Analysis of objects of interest in sensor data using deep neural networks

ABSTRACT

Sensor data captured by one or more sensors may be received at an analysis system. A neural network may be used to detect an object in the sensor data. A plurality of polygons surrounding the object may be generated in one or more subsets of the sensor data. A prediction of a future position of the object may be generated based at least in part on the polygons. One or more commands may be provided to a control system based on the prediction of the future position.

This application is a continuation of U.S. patent application Ser. No.15/606,875, filed May 26, 2017, which claims priority to U.S.Provisional Patent Application Ser. No. 62/343,071, filed May 30, 2016;U.S. Provisional Patent Application Ser. No. 62/343,077, filed May 30,2016; U.S. Provisional Patent Application Ser. No. 62/343,080, filed May30, 2016; and U.S. Provisional Patent Application Ser. No. 62/343,082,filed May 30, 2016, which are hereby incorporated by reference herein intheir entirety.

TECHNICAL FIELD

This disclosure relates to computer systems for autonomous analysis ofsensor data.

DESCRIPTION OF THE RELATED ART

Sensor data analysis may be required for a variety of applications. Forexample, given the challenges and continuous threat to security in themodern urban and suburban environments, video based surveillance isvital to obtain evidence and could be used to facilitate real-timeon-the-fly responses to emergency events. In 2014, over 245 millionvideo surveillance cameras were installed globally. Sensor data obtainedfrom a variety of sensor types including video and LIDAR (lightdetection and ranging) sensors may have to be analyzed for makingdecisions regarding future movements of autonomous vehicles, which areincreasingly a focus of research and development.

Depending on the particular application, a variety of types of objectsmay be of interest—for example, in the video-based security environment,individuals who may be performing potentially harmful activities may beof interest, while in the environment of an autonomous vehicle, othervehicles, pedestrians, road signs and the like may be of interest.Identifying, tracking and predicting future states of objects ofinterest in a variety of application domains using sensor data remains achallenging technical problem.

SUMMARY OF EMBODIMENTS

According to one embodiment, a system may comprise one or processors andan associated memory. The memory may store a neural network. The neuralnetwork may be configured to receive a representation of sensor datacomprising image frames, captured by one or more sensors such ascameras. The neural network may detect an object in the image frames andgenerate a plurality of polygons surrounding or enclosing the object invarious image frames. A prediction of a future position of the objectmay be generated by the neural network based at least in part on thepolygons. The one or more processors may provide one or more commands toa control system (such as a motion control system of a vehicle, or acamera control system which can pan, zoom or otherwise change the stateof a camera) based at least in part on the predicted future position.

In at least one embodiment, the neural network may generate a heat mapof an image frame. The heat map may comprise a plurality of pixels, withindividual ones of the pixels indicating a respective value representinga probability or likelihood that at least a portion of a detected objectis located at the pixel. The heat map and a post-processed imagecorresponding to the image frame may be gated to remove one or moreareas of the post-processed image that do not contain the object,producing a gated image in various embodiments. The post-processed imagemay have been obtained by performing one or more transformations on theimage frame, such as reducing the number of color channels,cross-correlation operations and the like in various embodiments.

In some embodiments, to generate the polygons, the neural network maygenerate a centroid and a set of vertices of a convex polygon from thegated image using a recurrent portion of the neural network. In oneembodiment, to generate the prediction of the future position of theobject, the neural network may be configured to obtain respectivecentroids and sets of vertices for individual ones of the plurality ofpolygons, and determine a position of a future polygon in a future imageframe, based at least in part on the respective centroids and sets ofvertices.

A variety of sensors and associated computing devices and controlsystems may be employed in different embodiments. For example, in oneembodiment the sensors may include a video camera, the control systemmay comprise a video camera controller, and the commands provided by theprocessors may instruct the video camera to move or change a zoonsetting to focus or maintain attention on the object. In anotherembodiment, at least some of the sensors (such as LIDAR devices) may belocated on or incorporated within a vehicle, the control system maycomprise a motion control subsystem of the vehicle, and the commands maycomprise motion control directives (such as to accelerate or deceleratethe vehicle).

In some embodiments, an object-of-interest database may be accessible atthe system comprising the neural network. An object of interest databasemay also be referred to in some embodiments as a region-of-interestdatabase. Using such a database, an object type (e.g., a vehicle, apedestrian, a road sign, or the like) of the object detected using thesensor data may be identified in various embodiments. Based at least inpart on the object type, an object analysis technique may be selectedand used to monitor respective portions of image frames that lie withinrespective polygons. The results of the monitoring may be used to detecta state change of the object, and the commands provided by theprocessors may be based at least partly on the state change.

The neural network may generate predictions of respective futuremovements of a plurality of objects detected in the image frames in someembodiments. A movement plan for a vehicle, whose execution would resultin moving the vehicle from a first position to a second positionrelative to the plurality of objects, may be determined using thepredictions of future movements. Commands provided by the processors tothe control systems may be based at least in part on the movement plan.

According to one embodiment, a method may comprise receiving sensor datacomprising one or more image frames from one or more sensors. The methodmay include using a neural network to detect an object in the imageframes and generating a plurality of polygons surrounding the object inindividual ones of the image frames. A future position of the object maybe predicted using the polygons, and one or more commands may beprovided to a control system based on the predicted future position.

In one embodiment, a non-transitory computer-accessible storage mediummay store program instructions that when executed on one or moreprocessors cause the one or more processors to receive sensor datacaptured by one or more sensors. The instructions when executed maycause the one or more processors to utilize a neural network to detectan object in the sensor data, generate a plurality of polygonssurrounding the object in one or more subsets of the sensor data, andgenerate a prediction of a future position of the object based at leastin part on the plurality of polygons. The instructions when executed mayfurther cause the one or more processors to provide one or more commandsto a control system based at least in part on the prediction of thefuture position of the object.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A illustrates an example computing system in which objects ofinterest in sensor data may be analyzed using deep neural networks,according to some embodiments.

FIG. 1B illustrates an example computer system that may be used toimplement one or more elements of a sensor data analysis system.

FIG. 1C illustrates an example high level workflow of the processing ofvideo data during analysis, according to some embodiments.

FIG. 2 illustrates an example flow of processing for detecting a regionof interest in video data, according to some embodiments.

FIG. 3 illustrates aspects of an annealed rectified linear unit whichmay be utilized for processing sensor data, according to someembodiments.

FIG. 4 is a flow diagram illustrating aspects of operations which may beperformed to train a deep neural network to detect regions of interestusing annealed rectified linear activations, according to someembodiments.

FIG. 5A depicts an example flow of processing for predicting polygons invideo data, according to some embodiments.

FIG. 5B depicts an example flow of processing for using recurrent neuralnetworks to detect polygons, according to some embodiments.

FIG. 5C is a flow diagram illustrating aspects of operations of arecurrent neural network used for polygon generation, according to someembodiments.

FIG. 6 is a schematic diagram of a Long-Short Term Memory unit that maybe used in polygon generation, according to some embodiments.

FIG. 7 illustrates an example flow of processing for polygon temporaltracking and future prediction, according to some embodiments.

FIG. 8 illustrates an example flow of processing for storing andretrieving data to and from a region of interest (ROI) identitydatabase, according to some embodiments.

FIGS. 9A and 9B illustrate examples of scheduling the movements of acamera based on detected objects and object velocities, according tosome embodiments.

FIG. 10A illustrates an example flow of processing for determiningattention focus of one or more cameras, according to some embodiments.

FIG. 10B is a flow diagram illustrating aspects of operations which maybe performed with respect to attentional planning for a camera,according to some embodiments.

FIG. 10C is a flow diagram illustrating aspects of operations forcontrolling an attention-focusing camera, according to some embodiments.

FIG. 11 illustrates an example search tree that may be used to planactions of an attention-focusing camera, according to some embodiments.

FIG. 12 illustrates an example computer system of a movable device atwhich sensor data may be analyzed using neural networks, according tosome embodiments.

FIG. 13 is a flow diagram illustrating aspects of operations forcontrolling the movements of a movable device, according to someembodiments.

FIG. 14 is a flow diagram illustrating aspects of operations formovement planning at a movable device using an attention focusingsystem, according to some embodiments.

While embodiments are described herein by way of example for severalembodiments and illustrative drawings, those skilled in the art willrecognize that embodiments are not limited to the embodiments ordrawings described. It should be understood, that the drawings anddetailed description thereto are not intended to limit embodiments tothe particular form disclosed, but on the contrary, the intention is tocover all modifications, equivalents and alternatives falling within thespirit and scope as defined by the appended claims. The headings usedherein are for organizational purposes only and are not meant to be usedto limit the scope of the description or the claims. As used throughoutthis application, the word “may” is used in a permissive sense (i.e.,meaning having the potential to), rather than the mandatory sense (i.e.,meaning must). Similarly, the words “include,” “including,” and“includes” mean including, but not limited to. When used in the claims,the term “or” is used as an inclusive or and not as an exclusive or. Forexample, the phrase “at least one of x, y, or z” means any one of x, y,and z, as well as any combination thereof.

DETAILED DESCRIPTION

It will be appreciated that for simplicity and clarity of illustration,in some cases, reference numerals may be repeated among the figures toindicate corresponding or analogous elements. In addition, some detailsor features are set forth to provide a thorough understanding of theembodiments described herein. However, it will be understood by those ofordinary skill in the art that the embodiments described herein areillustrative examples that may be practiced without these details orfeatures. In other instances, well-known methods, procedures andcomponents have not been described in detail so as not to obscure theconcepts illustrated in the examples described herein. Also, thedescription is not to be considered as limiting the scope of the exampleembodiments described herein or illustrated in the drawings. Theheadings used herein are for organizational purposes only and are notmeant to be used to limit the scope of the description or the claims. Itis noted that the terms “neural network based model”, “neural networkmodel” and “neural network” may be used synonymously with respect tovarious embodiments to refer to a machine learning model that includesor utilizes at least one network of artificial neurons. The term “deepneural network” (DNN) may be used in various embodiments to refer to aneural network which comprises a plurality of layers of artificialneurons.

In some embodiments, the techniques describe herein generally withrespect to sensor data may be applied at least in part to video data,which may for example be captured from surveillance cameras, autonomousvehicles, and the like. In applications using recorded video fromsurveillance cameras for evidence collection or post-analysis, severalaspects of the analysis may be considered important. Firstly, to theextent possible, the camera should readily capture important andrelevant data when an abnormal event is triggered, irrespective of thetime of the day the event takes place. Secondly, to the extent possible,the resolution and quality of the captured data must be sufficientlyhigh in order to be of use to investigators after the fact. Due tostorage and other hardware costs, many security cameras may not behigh-definition, but rather may capture video in low resolution, whichmay lose fine details. This may for example lead to a problematicsituation where a suspect/perpetrator's face is blurred or the detailsof an object of interest (e.g. a license plate) is too grainy to beusable for later investigations. Accordingly, in at least someembodiments, an intelligent system for sensor data analysis may utilizedeep neural networks to dynamically focus on the objects-of-interest atthe appropriate times by controlling a dynamic camera with pan, tilt andoptical zoom capabilities. Such an intelligent system may be referred toin some embodiments as an attention-focusing system.

In at least one embodiment, sensor data collected using various types ofsensors of an autonomous or semi-autonomous vehicle may be analyzedusing deep neural networks to enable decisions regarding the movementsof the vehicle to be made. Objects of interest in the operatingenvironment of the vehicle (such as a road on which the vehicle ismoving, other vehicles in the vicinity, pedestrians etc.) may beidentified and tracked over successive frames of video or other sensordata in various embodiments. Future movements of at least some of theobjects may be predicted using the results of the tracking. An on-boardcomputer system of the vehicle may then use such information to make avariety of decisions, include planning the movements of the vehicle,issuing low level motion directives to motion controllers, and so on.

Some vehicles may include a number of video cameras that can be used tomonitor the vehicle's surroundings on an ongoing basis in variousembodiments. However, at least during some intervals of time, thecaptured video data may not include anything of interest that requiresfurther analysis. In such a setting, the vehicle may monitor thecaptured video data to check for objects of interest. If an object ofinterest is detected, the vehicle's computer system may be alerted tofocus on that object to perform a more detailed analysis. For example,in one embodiment the vehicle may detect that a video frame indicates atraffic sign showing a speed limit. The vehicle may track the trafficsign over successive video frames, and apply text recognition to theportions of the video frames to determine the indicated speed limit inat least some embodiments. In such embodiments, the vehicle's computersystem may be relieved from constantly performing text recognition onthe entirety of every video frame to recognize speed limits.

In various embodiments, one or more neural network based models may beused to obtain unique representations for each object-of-interest (e.g.a person, a person's face, a car, a road sign, or any otherobject-of-interest). Furthermore in at least some embodiments, a highresolution image or video of the objects may be obtained using automatedcontrol mechanisms. Such high-resolution images or videos of theobject-of-interest may for example be recorded for future use, forwardedto a more specialized analysis system for further analysis, or providedto a decision-making system that is responsible for controlling othersystems, such as the motion control subsystem for an autonomous vehicle.

In some embodiments, a system for sensor data analysis may be organizedas a collection of modules, at least some of which may be designed,developed and/or debugged relatively independently. For example, in oneembodiment such a system may comprise at least five modules designatedas Modules I-Module V, each implemented using some combination ofhardware and software. Modules I and II may each comprise a MultilayerCross-Correlation Deep Neural Network (MCC-DNN) in some embodiments.Module III may be responsible for computing the velocity, and predictingthe locations of, each of one or more polygon ROIs (regions of interest)at one or more future points in time. Module IV may be responsible forencoding and storage of the identity of each polygon of interest into adatabase called an object-of-interest identity database. Module V maycomprise an attentional planner which aggregates the temporalpredictions of Module III with the semantic feature information fromModule IV to compute an optimal set of motor actions for one or morecontrolled devices. For example, in one embodiment, motor actions of acamera may be controlled such that high-resolution images would beobtained for as many polygon ROIs as possible. In some embodiments,Module V may generate one or more directives to control the operationsof cameras located on a vehicle, and/or the movements of the vehicleitself. A brief summary of some example aspects of the modules aredescribed below with respect to at least some embodiments.

Module I: Region of Interest (ROI) Object Detection Module. This modulemay, for example, take an image frame (e.g. from video data) as itsinput and perform a parallel Region of Interest (ROI) detection (e.g. todetect persons' faces, license plates, bags, an object being held, etc.)based on a Multi-layer Cross-Correlation Deep Neural Network (MCC-DNN)in some embodiments. The output of this module may comprise a heat mapin various embodiments, such as a (w×h) sized binary image with eachpixel taking on values between 0.0 and 1.0, representing the probabilityor likelihood of at least a portion of an object of interest beingpresent at the location corresponding to the pixel.

Module II: Polygon Prediction via Gated Feedback. This module may, forexample, combine the output of Module I, e.g. the heat map, with aninput frame by gating the activations of the lower level feature maps,leading to polygon detection. Instead of, for example, detectingrectangular bounding boxes, the module may detects polygons with morethan four sided for a more accurate description of the shape of theobject of interest compared to a rectangular box in at least someembodiments. Modules I and II combined may comprise a MultilayerCross-Correlation Deep Neural Network (MCC-DNN) in some embodiments.

Module III: Polygon ROI temporal prediction. This module may, forexample, be responsible for predicting the future position and scale ofvarious detected polygons, based for example on the previous motion anddynamics of the polygons. In one example implementation, a recurrentneural network (RNN) with a temporal frequency of 2 Hz may be used forprediction, meaning that successive time steps would be separated by 500milliseconds. Other frequencies may be used in other implementations. Insome embodiments the input to module III (which may comprise a recurrentneural network (RNN)) may include the polygon vertices at each timestep, with respect to a global image resolution. The RNN may includemultiple hidden layers with LSTM (Long-Short Term Memory) units in atleast one embodiment, with connections from one hidden layer to the nextin time. The output of the RNN at various time steps may, for example,comprise of the position (expressed using x, y, coordinates) of a givenpolygon and the size or area at each of the time steps. Given a startingcondition in the form of a detected polygon ROI, this temporal RNN maybe used to forward propagate N polygon ROIs in the scene in variousembodiments.

Module IV: Object-of-Interest Identity Database. In some embodiments, inorder to zoom into high-resolution data capture mode for a uniqueobject-of-interest once, for example when the object-of-interest firstappears within the camera's viewing area, a sensor data analysis systemmay assign a unique identifier to each region-of-interest. For example,a feature vector representing a face may represent a unique identifier,or a feature vector representing a vehicle may be considered a uniqueidentifier. To convert a polygon ROI into a unique feature vector, insome embodiments a rectangle which encloses the polygon may bedetermined. This rectangle may, for example, be used to crop the imageto be fed into a separate MCC-DNN with input size of, for example,128×128. Such an MCC-DNN may be trained in an unsupervised manner toreproduce its input, e.g., using an encoding-decoding neural networkarchitecture. Such an MCC-DNN may generate respective feature vectorsthat represent each unique object-of-interest in at least someembodiments.

Module V: Autonomous High Resolution Attention Focus. In someembodiments, based at least in part on a database of objects/person ofinterest, this module may optimize a dynamic temporal scheduler to focusits attention on particular objects, e.g., using pan, tilt and opticalzoom controllers. Module V may comprise an attention planner and anattention controller in various embodiments. It is noted that the terms“attention planner” and “attentional planner” may be used synonymouslyherein, and the terms “attention controller” and “attentionalcontroller” may also be used synonymously. In some embodiments, theattention planner may issue control directives to various controllers ona vehicle, including for example a video camera controller for thevehicle's cameras. The attention planner may analyze the detectedobjects-of-interest and make predictions regarding their movementrelative to the vehicle in some embodiments and, based on thisinformation, the generate control commands to, for example, change theorientation and zoom level of a vehicle camera to focus on the objects.

In some embodiments, as mentioned above, the neural network based sensordata analysis system comprising some combination of modules I-V may beimplemented as part of an autonomous vehicle. The autonomous vehicle mayhave one or more cameras that capture images of the surroundingenvironment as the vehicle moves. In such embodiments, the attentionplanner and attention controller may be implemented on an onboardcomputer of the vehicle, and may comprise a motion selection module thatsends motion directives to a motion control subsystem of the vehicle atvarious intervals. For example, motion directives may includeinstructions to turn, accelerate, decelerate, break, etc. Based on thedetected objects, the polygons, and their predictions motions relativeto the vehicle (which may be moving), the vehicle's computer may makevarious types of movement decisions.

FIG. 1A illustrates an example computing system in which objects ofinterest in sensor data may be analyzed using deep neural networks,according to some embodiments. As shown, system 100 may include one ormore camera devices 101, one or more camera controller devices 102, oneor more user computing devices 103, and one or more server machines 104in the depicted embodiment. These devices or computing systems may beable to communicate with each other over a data network 105. The datanetwork may include different types of data connections, for example,including both wired and wireless connections.

An example embodiment of a camera device 101 may include one or imagesensors (e.g. charge-coupled device (CCD) sensors, complementarymetal-oxide semiconductor (CMOS) sensors, etc.) 106, one or moreprocessors 107, and one or graphics processing units (GPUs) 108. Thecamera device may also include electromechanical actuators 109 tocontrol the camera's pan, tilt and zoom in some embodiments. In anotherexample embodiment, the pan, tilt and zoom control may be implemented atleast in part via software. The camera device 101 may also include oneor more memory devices 109 and one or more network communication devices114 to communicate with the network 105 in some embodiments. The memorydevice 109 stores, for example, a ROI object detection module 110 (alsocalled Module I), a polygon prediction module 111 (also called ModuleII), a polygon ROI temporal prediction module 112 (also called ModuleIII), and video data 113 collected by the subject camera in the depictedembodiment.

In some embodiments a camera, including the image sensor or sensors andthe actuators, may exist as a separate device and another computingdevice in data communication with the camera may include the one or moreprocessors, the one or more GPUs, the one more memory devices and theone or more network communication devices. In some embodiments,hundreds, or thousands, or millions of camera devices may be in datacommunication with one or more server machines 104 via the network 105.In various embodiments, camera devices 101 may be stationary or moving(e.g. mounted on a vehicle, mounted on a moving platform, etc.). It willbe appreciated that any of a variety of camera types may be used indifferent embodiments, including, for example, webcams, InternetProtocol security cameras, etc.

A camera controller device 102 may include one or more processors 115,one or more GPUs 116, one or more memory devices 117, and one or morenetwork communication devices 118 in the depicted embodiment. In someembodiments, the memory devices 117 may store video data 119 a, whichmay be captured by a camera device 101 and transmitted through thenetwork 105. In some embodiments, the memory devices 117 may storeexecutable modules, such as the camera control module 119 b, to controlvarious operational aspects of the camera device 101. In someembodiments, some or all of the modules 110-112 stored on the cameradevice 101 may be stored in the camera controller device 102. The cameracontroller device may be configured to control one or more of thecameras, such as by adjusting the pan, tilt, zoom, or combinationsthereof.

The user computing device 103 may include one or more processors, one ormore GPUs, one or more network communication devices, a display screen,one or more user input devices, and a memory device in the depictedembodiment. The memory device may include various software and/or data,such as an operating system 120, an Internet browser 121 to access aportal provided by the server machine 104, or a video application 122 toview and interact with the data provided by the server machine 104. Invarious embodiments, the user computing device may comprise, forexample, a tablet, a laptop, a desktop computer, a mobile device, asmart phone, a wearable computer, a computer built into a vehicle, etc.

The server machine 104 may comprise one or more physical and/or virtualmachines connected to the network 105 in the depicted embodiment. Insome embodiments, the server machines 104 may be cloud servers. In someembodiments, one or more server machines 104 may also be configured as acamera controller device 102, and include software such as the cameracontrol module 119 b. In some embodiments, a server machine 104 mayinclude some or all of the modules stored on the camera device 101, suchas for example modules 110 to 112. In some embodiments, the servermachine or machines 104 may be referred to as the “server system”.

The server system 104 may include one or more processors, one or moreGPUs, one or more network communication devices, and one or more memorydevices. The memory may store, for example, a video application 123,which may include one or more programmatic interfaces such as a graphicuser interface (GUI), command line tools, application programminginterfaces (APIs) and the like which may be invoked from user computingdevices 103. The memory of the server system 104 may include anautonomous camera attention focus module 124 (e.g. Module V of themodules indicated above), an ROI database module 126 and the ROIdatabase 125 (which in combination may comprise Module IV) in thedepicted embodiment. In some embodiments, ROI databases and associatedmodules may also be referred to as object-of-interest databases andassociated modules.

In some embodiments, Modules I, II and III (e.g., 110, 111, and 112) mayreside and operate locally on the camera device 101, e.g., in order toreduce bandwidth requirements associated with sending video data to theserver system 104 or any other computing device. By executing at leastsome of the sensor data analysis computations locally on the cameradevices in such embodiments, the overall speed of the analysis may beincreased, as transit times involved in network transfers may beavoided. In other embodiments the computations of Modules I, II and IIImay be executed by or at the server system 104, or by other computingdevices.

In one embodiment, a camera-equipped device 101 may include the cameraitself, a chip to convert the image signal (e.g. CCD signal or CMOSsignal) to a frame buffer, followed by a bus connecting it to amicroprocessor. The microprocessor may for example include one or moreof Digital Signal Processors (DSPs), Field Programmable Gate Arrays(FPGAs), Systems on Chip or mobile GPUs cards.

In some embodiments, objects of interest identified in sensor data maybe analyzed using deep neural networks in system such as autonomous orpartially autonomous vehicles. The term “autonomous vehicle” may be usedbroadly herein to refer to vehicles for which at least somemotion-related decisions (e.g., whether to accelerate, slow down, changelanes, etc.) may be made, at least at some points in time, withoutdirect input from the vehicle's occupants. In various embodiments, itmay be possible for an occupant to override the decisions made by thevehicle's decision making components, or even disable the vehicle'sdecision making components at least temporarily. Furthermore, in someembodiments, a decision-making component of the vehicle may request orrequire an occupant to participate in making some decisions undercertain conditions. The vehicle may include one or more sensors, one ormore sensor controllers, and a vehicle computer. The vehicle may alsoinclude a motion control subsystem, which controls a plurality of wheelsof the vehicle in contact with a road surface.

In some embodiments, the motion control subsystem may include componentssuch as the braking system, acceleration system, turn controllers andthe like. The components may collectively be responsible for causingvarious types of movement changes (or maintaining the currenttrajectory) of vehicle, e.g., in response to directives or commandsissued by decision making components. In some embodiments, the decisioncomponents may include a motion selector responsible for issuingrelatively fine-grained motion control directives to various motioncontrol subsystems, as well as a planner responsible for making motionplans applicable for longer time periods such as several seconds. Therate at which directives are issued to the motion control subsystem mayvary in different embodiments. Under some driving conditions (e.g., whena cruise control feature of the vehicle is in use on a straight highwaywith minimal traffic) directives to change the trajectory may not haveto be provided to the motion control subsystems at some points in time.For example, if a decision to maintain the current velocity of thevehicle is reached by the decision-making components, and no newdirectives are needed to maintain the current velocity, the motionselector may not issue new directives even though it may be capable ofproviding such directives at that rate.

The decision making components may determine the content of thedirectives to be provided to the motion control subsystem based onseveral inputs processed by the vehicle compute in differentembodiments. In some embodiments, the vehicle computer may implement asensor data analyzer which includes instances of some or all of thecomponents 110-113, 125, and/or 126. The sensor data analyzer mayimplement one or more neural networks configured to process sensor datacollected from the environment as the vehicle moves. The sensor dataanalyzer may, for example, receive images representing the operatingenvironment (such as a road) from the sensors at a regular frequency. Insome embodiments, each image may be analyzed to extract a plurality offeatures. Feature indicators may be provided to the decision makingcomponents, which may use the feature indicators to issue controldirectives to the motion control subsystem.

Inputs may be collected at various sampling frequencies from individualvideo cameras and/or other sensors by the sensor data analyzer indifferent embodiments. At least some frames of the video may beprocessed at the neural network(s) of the sensor data analyzer in thedepicted embodiment. In some embodiments, the sensor data analyzer mayanalyze the video frames or other sensor data at a slowly frequency thanthe rate at which the data are being generated. Different cameras andother sensors may be able to update their output at different maximumrates in some embodiments, and as a result the rate at which the outputis obtained at the decision making components may also vary from onesensor to another.

A wide variety of sensors may be employed in the depicted embodiment,including for example video or still cameras, radar devices, LIDAR(light detection and ranging) devices and the like. In addition toconventional video and/or still cameras, in some embodimentnear-infrared cameras and/or depth cameras may be used. In someembodiments, the sensors may comprise one or more camera devices 101.Different types of sensors may be used in different contexts. Forexample, while certain image sensors may capture good quality sensordata during high-light scenarios, they may provide very little usefulsensor data in low-light scenarios, as the image data may not be able todistinguish objects within the environment. However, other sensors, suchas a LiDAR sensor may have good low light capabilities. Becausedifferent sensors may capture redundant information (e.g., like theimage sensor and LiDAR example above), fusion techniques may sometimesbe implemented to leverage the strengths of different sensors indifferent scenarios. Several of these devices may be used to repeatedlygenerate successive frames in a continuous “video” of the road scene orother aspects of the vehicle environment over a period of time. Forexample, a LIDAR device may be used to produce a LIDAR video, and aninfrared camera may be used to produce an infrared video, and so on. Insome embodiments, additional sensors may be used to generate videosand/or add information to the captured scene, which may be included invarious video frames captured by vehicle cameras. Such additionalsensors may include radars, ultrasonic sensors, light beam scanningdevices, infrared devices, location sensors (e.g., global positioningsatellite (GPS) or Differential GPS (DGPS)), or inertial measurementsensors (e.g., accelerometers, speedometers, odometers, and angular ratesensors, like gyroscopes) in different embodiments. Various ones ofthese sensors may capture and provide raw sensor data to respectivesensor data processing pipelines implemented by the vehicle computer 135to may make perception decisions, such as detecting, classifying, ortracking objects as discussed in further detail below.

In various embodiments, the vehicle computer may include a number ofmodules and/or data that may be used to implement a sensor data analysissystem on the vehicle. The modules may include for example the ROIobject detection module 110, the polygon prediction module 111, the ROItemporal prediction module 112, the ROI database 125, and the ROIdatabase module 125. In some embodiments, the ROI database 125 mayshadow a master ROI database maintained on a remote server, and receiveperiodic updates from the master ROI database. In some embodiments, thevehicle computer may store some amount of video data 113, which maycomprise raw or processed video images captured by the sensors.

In some embodiments, the vehicle computer may communicate with a sensorcontroller to control the operation of the sensors. For example, in someembodiments, the vehicle computer or the sensor controller may implementa camera control module 119 b. The camera control module may operate tocontrol various aspects of the vehicle's sensors, such as for examplethe pan-tilt-zoom operations of the cameras. In some embodiments, thesensors may include actuators such as actuators 109.

It is noted that although, by way of example, various operations relatedto sensor processing are described in the context of image frames orvideo frames, the techniques and algorithms described herein may beapplied with equal success to groups of sensor data that may notnecessarily include images as such. An image frame may thus beconsidered just one example of a group or collection of sensor datacorresponding to a particular time of data capture. Other examples ofsuch groups may comprise, for example, infrared data, LIDAR data,temperature data and the like. In at least some embodiments various setsof image data may be analyzed in combination with non-image data—e.g.,images captured in low light conditions may be enhanced using infra-reddata and the like.

FIG. 1B illustrates an example computer system that may be used toimplement one or more elements of a sensor data analysis system,according to some embodiments. In at least some embodiments, a systemand/or server that implements a portion or all of one or more of themethods and/or techniques described herein, including the techniques toprocessed video images, to execute machine learning algorithms includingneural network algorithms, to access remote databases, to control theoperations of the cameras, and the like, may be executed on ageneral-purpose computer system that includes or is configured to accessone or more computer-accessible media. FIG. 1B illustrates such ageneral-purpose computing device 150. In the illustrated embodiment,computing device 150 includes one or more processors 152 coupled to amain memory 154 (which may comprise both non-volatile and volatilememory modules, and may also be referred to as system memory) via aninput/output (I/O) interface 156. Computing device 150 further includesa network interface 160 coupled to I/O interface 156, as well asadditional I/O devices 158 which may include sensors of various types.

In various embodiments, computing device 150 may be a uniprocessorsystem including one processor 152, or a multiprocessor system includingseveral processors 152 (e.g., two, four, eight, or another suitablenumber). Processors 152 may be any suitable processors capable ofexecuting instructions. For example, in various embodiments, processors152 may be general-purpose or embedded processors implementing any of avariety of instruction set architectures (ISAs), such as the x86,PowerPC, SPARC, or MIPS ISAs, or any other suitable ISA. Inmultiprocessor systems, each of processors 152 may commonly, but notnecessarily, implement the same ISA. In some implementations, graphicsprocessing units (GPUs) may be used instead of, or in addition to,conventional processors.

Memory 154 may be configured to store instructions and data accessibleby processor(s) 152. In at least some embodiments, the memory 154 maycomprise both volatile and non-volatile portions; in other embodiments,only volatile memory may be used. In various embodiments, the volatileportion of system memory 154 may be implemented using any suitablememory technology, such as static random access memory (SRAM),synchronous dynamic RAM or any other type of memory. For thenon-volatile portion of system memory (which may comprise one or moreNVDIMMs, for example), in some embodiments flash-based memory devices,including NAND-flash devices, may be used. In at least some embodiments,the non-volatile portion of the system memory may include a powersource, such as a supercapacitor or other power storage device (e.g., abattery). In various embodiments, memristor based resistive randomaccess memory (ReRAM), three-dimensional NAND technologies,Ferroelectric RAM, magnetoresistive RAM (MRAM), or any of various typesof phase change memory (PCM) may be used at least for the non-volatileportion of system memory. In the illustrated embodiment, executableprogram instructions 155 a and data 155 b implementing one or moredesired functions, such as those methods, techniques, and data describedabove, are shown stored within main memory 154.

In one embodiment, I/O interface 156 may be configured to coordinate I/Otraffic between processor 152, main memory 154, and various peripheraldevices, including network interface 160 or other peripheral interfacessuch as various types of persistent and/or volatile storage devices,sensor devices, etc. In some embodiments, I/O interface 156 may performany necessary protocol, timing or other data transformations to convertdata signals from one component (e.g., main memory 154) into a formatsuitable for use by another component (e.g., processor 152). In someembodiments, I/O interface 156 may include support for devices attachedthrough various types of peripheral buses, such as a variant of thePeripheral Component Interconnect (PCI) bus standard or the UniversalSerial Bus (USB) standard, for example. In some embodiments, thefunction of I/O interface 156 may be split into two or more separatecomponents. Also, in some embodiments some or all of the functionalityof I/O interface 156, such as an interface to memory 154, may beincorporated directly into processor 152.

Network interface 160 may be configured to allow data to be exchangedbetween computing device 150 and other devices 164 attached to a networkor networks 162, such as other computer systems or devices asillustrated in the figures. In various embodiments, network interface160 may support communication via any suitable wired or wireless generaldata networks, such as types of Ethernet network, for example.Additionally, network interface 160 may support communication viatelecommunications/telephony networks such as analog voice networks ordigital fiber communications networks, via storage area networks such asFibre Channel SANs, or via any other suitable type of network and/orprotocol.

In some embodiments, main memory 154 may be one embodiment of acomputer-accessible medium configured to store program instructions anddata as described herein for implementing embodiments of thecorresponding methods and apparatus. However, in other embodiments,program instructions and/or data may be received, sent or stored upondifferent types of computer-accessible media. Various embodiments mayfurther include receiving, sending or storing instructions and/or dataimplemented in accordance with the foregoing description upon acomputer-accessible medium. Generally speaking, a computer-accessiblemedium may include non-transitory storage media or memory media such asmagnetic or optical media, e.g., disk or DVD/CD coupled to computingdevice 150 via I/O interface 156. A non-transitory computer-accessiblestorage medium may also include any volatile or non-volatile media suchas RAM (e.g. SDRAM, DDR SDRAM, RDRAM, SRAM, etc.), ROM, etc., that maybe included in some embodiments of computing device 150 as main memory154 or another type of memory. Further, a computer-accessible medium mayinclude transmission media or signals such as electrical,electromagnetic, or digital signals, conveyed via a communication mediumsuch as a network and/or a wireless link, such as may be implemented vianetwork interface 160. Portions or all of multiple computing devicessuch as that illustrated in the figure may be used to implement thedescribed functionality in various embodiments; for example, softwarecomponents running on a variety of different devices and servers maycollaborate to provide the functionality. In some embodiments, portionsof the described functionality may be implemented using storage devices,network devices, or special-purpose computer systems, in addition to orinstead of being implemented using general-purpose computer systems. Theterm “computing device”, as used herein, refers to at least all thesetypes of devices, and is not limited to these types of devices.

FIG. 1C illustrates an example high level workflow of the processing ofvideo data during analysis, according to some embodiments. As shown,input comprising RGB (red-green-blue) image frames 170 may be fed into aMultilayer Cross-Correlation Deep Neural Network (MCC-DNN) 172. Theoutput of the MCC-DNN may be a set of one or more polygons with avariable numbers of vertices, as shown in element 174. Each polygon mayenclose a region-of-interest (ROI) for further analysis in someembodiments. Velocities of the polygons may then be estimated (element176), e.g., using computations on the polygon ROIs from the previousframe in time, as shown in element 178. This outputs an estimatedvelocity or other movement dynamic (e.g. position, acceleration, jerk,etc.) of each ROI. To uniquely identify objects within a scene in agiven time period, unique object-level identity features may beextracted and stored into a ROI ID database 125 in the depictedembodiment. The estimated velocity of the trajectory of an object may befed into an attentional planner module 180 to plan for the optimaltrajectory and zoom level to focus attention on the ROI in variousembodiments. For example, in some embodiments, a plan generated by theattentional planner module 180 may be used to control the movement of apan-tilt-zoom camera, such that it zooms into as many ROIs as possiblegiven the physical camera system constraints. The constraints mayinclude, for example, the current state of the camera, which may betransmitted by the attentional controller 182 or from the camera itself.Other camera constraints in various embodiments may include, forexample, the rate of movement (e.g. how many degrees per second thecamera can pan or tilt, or both), the rate of zoom, the boundaries ofmovement (e.g. upper and lower bounds of pan or tilt, or both), and/orthe boundaries or limits of zoom.

In some embodiments, the attentional controller 182 may communicate withthe attentional planner 180 to control the movement of a physicalcamera. In one embodiment, the attentional controller 182 may compriseone or more camera control devices 102, as discussed in connection withFIG. 1A. In some embodiments, the attentional controller may comprise asensor controller. For example, an attentional controller 182 may causethe camera, based on a planned trajectory, to switch focus from a firstobject at a first point in time to a second object at a subsequent pointin time. An example of such processing is shown in element 184, whichshows a video frame that has been processed to focus on an object ofinterest 186 using camera panning and zooming.

Operations corresponding to element 172 of FIG. 1C may, for example, beexecuted by instances or implementations of Modules I and II of thesystem described earlier in some embodiments. Operations correspondingto element 176 may, for example, be executed by an instance of ModuleIII. The operations of processing and loading data into the ROI database125 may be executed, for example, by instances of Module IV. Theoperations of blocks 180, 182 may be executed, for example, by instancesof Module V. Further details regarding each of the modules are providedbelow with respect to at least some embodiments.

Module I: Region of Interest (ROI) Detection

In some embodiments, Module I for detecting regions of interest, e.g.regions with abnormal activities, or regions with objects of interest,may be a specific instance of a general class of deep neural networksthat may be referred to as Multilayer Cross Correlation Deep NeuralNetwork (MCC-DNN). MCC-DNNs may perform two-dimensional (2D) crosscorrelation operations between a lower layer of the model and a set offeatures or kernels in various embodiments. This process may beperformed recursively in some embodiments, as the output representationof one layer may be used as the input representation of the next-higherlayer. In various embodiments, an MCC-DNN may be capable of generatingboth a probabilistic heat map of regions-of-interest and a set of 2Dpolygons marking the exact spatial extent of the objects of interest.

FIG. 2 illustrates an example flow of processing for detecting a regionof interest in video data, according to some embodiments. The processingmay be performed using an instance or implementation of Module I invarious embodiments. In some embodiments, the input to the module maycomprise a raw RGB image, while the output may be a single-channel grayscale image akin to a heat map. Heat map regions with higher values mayindicate regions with a higher probability of having anobject-of-interest or abnormal activity present in that region invarious embodiments

As shown in FIG. 2, an input image 201 may be obtained, and varioustransformation or pre-processing operations 202 may be performed on itto generate a post-processed image 204. Details of the pre-processingare provided below with respect to various embodiments. Filters 203 maybe applied to the post processed image 204 in the depicted embodiment,generating feature maps 206 a. The remaining portion or portions of thefiltered image may be cross-correlated 205 with feature maps 206 a or206 c. The cross correlations may be among the computing operations thatare part of the MCC-DNN's functions in various embodiments. Interspersedbetween the cross-correlation operations, the spatial resolution ofportions of the feature maps may be reduced, as indicated by arrows 207.Spatial up-sampling may be used in some embodiments to return thespatial resolution to its original level 208. As output of this phase ofsensor data processing, a heat map 209 showing regions of high interest210 or regions of low interest 212 (or both high and interest and lowinterest regions) may be provided in various embodiments. Furtherdetails regarding sub-processes of Module I are provided below withrespect to various embodiments.

Module I Pre-Processing

In various embodiments the input to module I may include, for example,decoded RGB images from a frame grabber of a video camera. In someembodiments, each frame may consist of 8-bit per pixel 3 channel RGBimage data. Given these color images, the camera device 101, may, forexample, reduce the number of color channels from 3 to 2 using anonlinear preprocessing algorithm which divides the red and greenchannels by the maximum of all three channels. This may reduce intensityof the representation. The natural logarithm may then be taken after thedivision in various embodiments as indicated in equations (1) and (2).

First Channel: log(R/max(R,G,B))  (1)

Second Channel: log(G/max(R,G,B))  (2)

In various embodiments, this preprocessing may nonlinearly transform theinput to the MCC-DNN from 3 to 2 channels, while providing a certaindegree of illumination invariance to improve recognition accuracy.

Module I Cross-Correlation Operations

In some embodiments, operations performed in the MCC-DNN may comprisethe 2D cross-correlation operations between the feature map of theprevious layer and a 2D kernel (or 2D filter) of, for example, dimension3×3. For clarity of presentation, the 1D cross correlation between ƒ andg may be defined as:

(ƒ*g)(t)

∫_(τ)ƒ(τ)g(t+τ)dτ  (3)

This is in contrast to the standard 1D convolution operation whichessentially requires the flipping of kernel g:

(ƒ⊗g)(t)

∫_(τ)ƒ(τ)g(t−τ)dτ  (4)

In at least some embodiments, the kernel may be learned from scratch,and there may be no algorithmic reason to prefer the use of standardconvolutions. In various embodiments, 2D cross-correlation operationsmay comprise the main operands used in the MCC-DNNs described herein.This may reduce the complexity of the model and speed up processing timewithout negatively effecting model performance in at least someembodiments.

In the example of a 2D cross-correlation between an image I(x, y) and a3×3 kernel K(x′, y′), the operation to compute cross-correlationresponse J may be defined as follows with respect to at least someembodiments:

$\begin{matrix}{{J\left( {x,y} \right)} = {\sum\limits_{\overset{\sim}{x} = {- 1}}^{\overset{\sim}{x} = 1}{\sum\limits_{\overset{\sim}{y} = {- 1}}^{\overset{\sim}{y} = 1}{{I\left( {{x + \overset{\sim}{x}},{y + \overset{\sim}{y}}} \right)} \times {K\left( {\overset{\sim}{x},\overset{\sim}{y}} \right)}}}}} & (5)\end{matrix}$

Module I—Activation Function

For each layer within the MCC-DNN, following the 2D cross-correlation,in some embodiments a nonlinear activation function may be appliedindependently for all hidden nodes. One activation function that may beused is the Rectified Linear Unit (ReLU) activation, which may berepresented as follows:

ƒ(x)=max(0,x)  (6)

This activation function has two key properties. First, it may result ina sparsification (an increase in sparsity) of the activations, which mayin turn lead to better generalization. Second, there may be no plateausin the values of x which would result in the post activation functionbeing flat.

In some embodiments, an MCC-DNN used for sensor data analysis mayutilize Annealed Rectified Linear Unit (Annealed ReLU) activation. InAnnealed ReLU, instead of using a fixed activation throughout the entiretraining process, which could result in optimization possibly beingstuck in a local minimum, the following time-dependent activationfunction may be used in at least some embodiments:

$\begin{matrix}{{f_{t}(x)} = {\max\left\{ {{{\max\left( {\frac{T - t}{T},0} \right)}x},x} \right\}}} & (7)\end{matrix}$

In Equation (7), t is the current “time” in the optimization sense (e.g.based on the number of epochs of the training process that have beencompleted), and T is a hyper-parameter, specifying at what point theactivation function becomes the standard rectified function of Equation(6).

FIG. 3 illustrates aspects of an annealed rectified linear unit whichmay be utilized for processing sensor data, according to someembodiments. The graphs 301, 311 and 321 depict the behavior of theactivation function ƒ(x) for different combinations of t and T as thenetwork is being trained. T may be set to 100 in the depictedembodiment. Initially, at the beginning of training, as seen in thegraph 301, the activation function behaves like a linear identitytransfer function. A given deep neural network with all linearactivation functions is known to be convex, thereby making optimizationeasier. As training progresses from iteration 1 to iteration T, theactivation gradually reaches the final shape of the Rectified LinearUnit (ReLU) function shown in graph 321, with ƒ_(r)(x) being zero forall x<0, after passing through an intermediate stage indicated in graph311.

FIG. 4 is a flow diagram illustrating aspects of operations which may beperformed to train a deep neural network to detect regions of interestusing annealed rectified linear activations, according to someembodiments. As shown in element 401, a training system may obtaintraining images and ground truth polygon labels. As indicated in element402, the training system may randomly initialize the MCC-DNN, and mayfor example set T to 10000 and t to 0 with respect to the Annealed RelUequation (7) above. As training proceeds, the training system may modifythe activation functions based on t and T in accordance with equation(7) in the depicted embodiment (element 403) The training system mayadapt the parameters of the MCC-DNN by taking one step of stochasticgradient descent, as indicated in element 404. If t is greater than orequal to T as determined in operations corresponding to element 405, thetraining may be terminated as indicated in element 407. Otherwise, thetraining system may loop back to element 403 to perform furthermodification of the activation function for the next epoch. Prior tolooping back, the iteration counter t may be incremented by 1 asindicated in element 406. In this fashion, the training system may trainan MCC-DNN comprising Annealed Rectified Linear Units in at least someembodiments.

Module I—Spatial Reduction

In some embodiments, after at least some cross-correlation operationsbetween input feature maps and output feature maps, the spatialresolution of the representation may be reduced by a spatial averageoperation. The spatial average operation may take, for example, thenearby 3×3 neighborhood in a feature map, and reduce it to a singlevalue by taking the average of the 9 values of the neighborhood. Thisoperation may be performed throughout the locations of each of thefeature maps in some embodiments. In effect, the spatial reductionoperation 207 shown in FIG. 2 may reduce the spatial resolution of thefeature maps 206 by subsampling in such embodiments.

Module I—Spatial Up-Sampling

In some embodiments, the MCC-DNN may include computing operations thatundo the spatial subsampling, to return the spatial resolution of thehidden layers to the resolution level prior to the subsampling. Suchoperations are indicated by element 208 of FIG. 2. When predicting theRegion-Of-Interest, in some embodiments it may be useful to predictregions at a finer scale (e.g. at a pixel level). One way to connectcoarse, low-resolution inputs (e.g., a bottleneck layer) to finerresolution regions is via interpolation. For example, in someembodiments, the camera device may use a linear interpolation thatcomputes each output map from the nearest four inputs by a linear mapthat depends only on the relative positions of the input and outputunits within the map. This may represent a form of up-sampling or imagesuper resolution in at least some embodiments.

Module I—ROI Heat Map Generation

In some embodiments, a region-of-interest may be represented in the formof a heat map H(x, y) of width w and height h. H(x, y) may consist ofvalues between 0.0 and 1.0 in some implementations, with each valuerepresenting the probability of having an object of interest present atthe corresponding location. Such a heat map 209 may be produced as theoutput of Module I in various embodiments as shown in FIG. 2.

Unlike some traditional scanning-window models, the heat map 209 may notbe computed serially in at least some embodiments by scanning windows ofdifferent sizes, rotation, and scale, which is not easily scalable tolarge images and real time video processing. Instead, the heat map 209may be generated in parallel from the MCC-DNN, where the internalrepresentation has already coded for various features useful forrepresenting regions of interest or regions of high likelihood ofinterest such as human faces, road objects, or other high-value targets.For example, multiple or all the pixels of the heat map 209 may begenerated in parallel. Such parallelization may help to rapidly identifythe regions that are of interest in various embodiments.

Module II: Polygon Prediction

In some embodiments, the output of Module I may be a predicted heat mapH(x, y)∈[0.0, 1.0] as discussed above. The higher the value of a cell inthe heat map H(x, y), the more likely it may be that something ofinterest is present at that location in the image in such embodiments.

In various embodiments, a second module, Module II, may use thecombination of (a) a predicted heat map 209 and (b) the post-processedimage 204 from the Module I as input to generate a finite set of convexpolygons which surround objects-of-interest. FIG. 5A depicts an exampleflow of processing for predicting polygons in video data, according tosome embodiments. A gated feedback technique may be employed to removecertain areas of intermediate images from the prediction workflow in thedepicted embodiment. As shown, the heat map 209 and the post-processedimage 204 may be multiplicatively gated 501 to generate a gated image502. Filters 503 may be applied to the gated image, and the one or moreremaining portions may be cross-correlated 504. The feature maps 505, orportions thereof, may undergo spatial reduction 506. The operations mayoccur several times when generating the feature maps. Recurrentconnections 508 in the neural network may be used to output one side ofa predicted polygon at a time in some embodiments. The output 509 maycomprise multiple predicted polygons in various embodiments.

Module II—Gating

In some embodiments, a first operation of Module II may comprise thegating 501 of the heat map 209 with the post-processed image 204 toproduce a gated image 502. The gating may be a multiplicative gatingthat is implemented by element-wise multiplication of the heat map H(x,y) and the image I(x, y) at all locations in at least one embodiment.Gating techniques which are not multiplicative may be employed in oneembodiment. In some embodiments, gating may remove areas of input imageswhich are not interesting, for example, background or static structuresin the scene.

In some embodiments, gated feedback of the kind illustrated in FIG. 5Amay help to focus the capabilities of the polygon detection system atonly regions of interest, making it computationally efficient.Furthermore, in at least some embodiments, such gated feedback mayperform a version of non-maximal suppression or lateral inhibition sothat there is a one-to-one match between a predicted convex polygon andan object-of-interest (e.g. a face, a road object, etc.). An MCC-DNNused in Module II may be similar to an MCC-DNN used in Module I in someembodiments. However, the two MCC-DNNs may not necessarily use the sameset of parameters in at least some embodiments.

Module II—Recurrent Connections

In some embodiments, the sensor data analysis system may be capable ofpredicting convex polygons with a varying number of vertices and edges.FIG. 5B depicts an example flow of processing for using recurrent neuralnetworks to detect polygons, according to some embodiments. In someembodiments, a recurrent neural network (which may be termed“PolygonRNN” herein) which takes the output of the gating MCC-DNNfeatures as input may be used. PolygonRNN may be configured to generatea varying but finite number of polygons per object of location. Withineach hidden layer 510 (e.g., 510 a, 510 b, . . . , 510 n), PolygonRNNmay use one or more Long Short Term Memory (LSTM) units in at least someembodiments. Recurrent connections 508 (e.g., 508 a, 508 b etc.) maylink various hidden layers in the depicted embodiment.

FIG. 6 is a schematic diagram of a Long-Short Term Memory unit that maybe used in polygon generation, according to some embodiments. An exampleLSTM unit 600 is shown in isolation in FIG. 6. Each LSTM unit may have acell 610 which has a state c_(t) at time t. The cell may function as amemory unit. Access to this memory unit for reading or modifying may becontrolled through sigmoidal gates, for example, input gate i_(t) 620,forget gate ƒ_(t) 630 and output gate o_(t) 640. In some embodiments,the LSTM unit may operate as follows. At each time step it may receiveinputs from two external sources at each of the four terminals (thethree gates and the input). The first source may comprise the currentframe x_(t). The second source may comprise the previous hidden statesof all LSTM units in the same layer h_(t-1). Additionally, each gate mayhave an internal source, the cell state c_(t-1) of its cell block. Theinputs coming from different sources may be added up, and a bias may beapplied. The gates may be activated by passing their total input throughthe logistic function. The total input at the input terminal may bepassed through the tanh non-linearity. The resulting activation may bemultiplied by the activation of the input gate. This may then be addedto the cell state after multiplying the cell state by the forget gate'sactivation ƒ_(t). The final output from the LSTM unit h_(t) may becomputed by multiplying the output gate's activation o_(t) with theupdated cell state passed through a tanh non-linearity. The followingexemplary update equations may summarize operations of a layer of LSTMunits according to at least some embodiments:

i _(t)=σ(W _(xi) x _(t) +W _(hi) h _(t-1) +W _(ci) x _(t-1) +h_(i))  (8)

ƒ_(t)=σ(W _(xf) x _(t) +W _(hf) h _(t-1) +W _(cf) x _(t-1) +h _(f))  (9)

c _(t)=ƒ_(t) c _(t-1) +i _(t) tanh(W _(xc) x _(t) +W _(hc) h _(t-1) +h_(c))  (10)

o _(t)=σ(W _(xo) x _(t) +W _(ho) h _(t-1) +W _(co) x _(t-1) +h_(o))  (11)

h _(t) =o _(t) tanh(c _(t))  (12)

We matrices referenced in equations (8)-(11) may be diagonal in thedepicted embodiment, whereas the rest of the matrices may be dense. tanhis the hyperbolic tangent function. Note that the cell state c_(t) in anLSTM unit may sum activities over time. Since derivatives distributeover sums, the error derivatives do not vanish quickly as they get sentback into time. This makes it easy to do credit assignment over longsequences to discover long-range features using LSTMs in variousembodiments.

In some embodiments, after taking the gated image 502 and using theMCC-DNN to determine the activation of the layer just before therecurrent connections 508 in FIG. 5A, the recurrent connections 508 a-min FIG. 5B allow the model to output one edge 516 of the polygon at atime until a “stop” token is produced. In some embodiments, the weights(or parameters) of the recurrent connections 508 a-m may be tied (andequal) for all time steps. FIG. 5C is a flow diagram illustratingaspects of operations of a recurrent neural network used for polygongeneration, according to some embodiments. The corresponding pseudo-codeis provided in pseudo-code section 1.

As shown in element 520 of FIG. 5C, input for polygon generation may beobtained from the MCC-DNN activation in the depicted embodiment. Tworeal-valued numbers may be generated representing the centroid of apolygon (element 521), and a parameter representing a current angle maybe set to 0. If a STOP symbol is generated, as detected in operationscorresponding to element 522, the operations of the recurrent neuralnetwork may be terminated (element 526), the vertices making up thepolygon may be collected and output comprising the location and sides ofa convex polygon may be generated (elements 527 and 528).

Pseudo-code section 1: polygon generation INPUT (element 520 of FIG.5C): MCC-DNN layer activation before the recurrent connections OUTPUT(element 528): Location and sides of a convex polygon  Step 1 (element521): Generate two real valued numbers for the centroid of the polygon:x, y.   Set current angle to be 0 radians.  Step 2:   IF a stop symbolis generated (element 522): Stop the RNN (element 526), and GOTO Step 5   (element 527).   ELSE:    Step 3 (element 523): Generate two realvalued numbers: change in angle and radius from     centroid.     Thechange in angle is the change of angle from the previous angle incounter      clockwise direction. Radius from centroid is the distancefrom the centroid of the      polygon.     The two values generated mustsatisfy the convex polygon property. If the radius is too      big ortoo small for convexity, then the value will be projected to the closestvalue      in the feasible set.     Given a generated change in angleand radius, a new vertex is obtained. GOTO Step 5      when the feasibleset is the null set (element 529).    Step 4 (element 524): The lineconnecting the previous vertex to the current vertex is one     edge ofthe polygon.     Update the angle and repeat Step 2 (element 525).  Step5 (element 527): The collection of the vertices makes up the convexpolygon.

If a STOP symbol is not generated, two real valued numbers for a currentvertex may be generated (element 523). It may be the case that afeasible set for the new vertex is a null set (as detected in element529), in which case the vertices that make up the polygon may becollected (element 527) and the output may be provided. If the feasibleset is not null, an edge of the polygon may be generated by connectingthe current vertex with a previous vertex (element 524) in the depictedembodiment. The angle may be updated and the next vertex may begenerated (element 523).

Module III: Polygon ROI Temporal Prediction and Tracking

FIG. 7 illustrates an example flow of processing for polygon temporaltracking and future prediction, according to some embodiments. In someembodiments, given the MCC-DNN detections of polygons at each time stepfrom Module II, a statistical auto-regressive model may be formulated tomodel the motion and dynamics of polygons. In one implementation, such amodel may operate at an example temporal frequency of 2 Hz (taking inputat 500 millisecond intervals) and make a Markov assumption. At everytime step, the system may receive the current data of a polygon 701,including the location of the polygon and its vertices, and provide thedata as input into a model 702. In some embodiments, the model structuremay be similar to that of the recurrent neural network described abovefor polygon detection, but with different input representations. Byobserving the polygon for one or more past time steps 703 (e.g., t−3,t−2, etc.) using the recurrence connections, Module III may forwardsimulate the future position trajectory and size changes of each polygonin the scene. For example, the predicted future position, velocity, andmore generally the movement of the polygons may be provided as output704 with respect to one or more future frames in the depictedembodiment. This process may be performed for all detected polygons invarious embodiments.

In some embodiments, with reference to FIG. 7, the vertices of thepolygon may undergo a linear transformation with a weight matrixfollowed by a nonlinear activation, for example, using ReLU activation.The resulting vector may comprise a hidden vector. The hidden vector attime t may be multiplied with another weight matrix to generate thehidden vector at time t+1. During time steps 703 in the depictedscenario, the polygons corresponding to t−3 to t may be provided by theMCC-DNN. Starting in times 704, the polygon may not be provided by theMCC-DNN, but rather may be generated by the hidden vectors at time t+1and time t+2, respectively. In some embodiments, the computations tooutput these future predictions may be similar to the computationdescribed in connection with FIG. 5B above. In some embodiments, tocompute the polygon at time t+2, the generated polygon from time t+1 maybe treated by the system as an additional input from the past. Forexample, in computing the polygon at time t+2, the “now” boundarybetween 703 and 704 may be shifted to the right.

Module IV: Object-of-Interest Identity Database

FIG. 8 illustrates an example flow of processing for storing andretrieving data to and from a region of interest (ROI) identitydatabase, according to some embodiments. In one embodiment, suchprocessing may be performed by a separate ROI database module (e.g.,module 126 of FIG. 1A). In some embodiments, the system may assign aunique identifier to each region-of-interest. For example, a featurevector representing a face or an object identified in an environment ofa vehicle may be associated with or used as a unique identifier invarious embodiments. Such unique identifiers may of course differ fordifferent objects, while being as similar as possible for the sameobject across time in some embodiments. To convert a polygonregion-of-interest (ROI) into a unique feature vector, the computingsystem may first automatically generate a rectangle 802 which enclosesthe polygon 801 in the depicted embodiment. This rectangle may be usedto crop the image. The cropped image 803 may be provided as input intoanother separate MCC-DNN 804. In some embodiments, the cropped image mayhave, for example, an input size of 128 by 128. The MCC-DNN 804 may betrained in an unsupervised manner to reproduce its input, in anencoding-decoding architecture. The activation of the middle layer (e.g.64 dimensions) of the MCC-DNN 804 may comprise a feature vector 805 thatcan be used to represent the ROIs.

In some embodiments, a multi-dimensional feature vector 805 may comprisea semantic code that is real-valued and can be stored into the database125 and later be retrieved to compare against other codes. The codes maybe assigned such that codes of the same object/person from differentvideo frames in time are nearby in a Euclidean code space (e.g., a64-dimensional space), whereas the codes for polygons representingdifferent objects have much higher Euclidean distance. In variousembodiments, entries may be extracted from the database 125, toreproduce the desired polygon 801 and the image data therein—that is,the database contents may be used to uniquely recreate the image orother sensor data of interest.

Module V: High Resolution Attention Focus

As discussed above, modules I, II, and III of the sensor data analysissystem may operate to provide detection of the regions-of-interest inthe form of the convex polygons in various embodiments. These modulesmay also provide an estimate of the future location and size changes ofeach detected polygon in the sensor data. In at least some embodiments,Module IV may provide a way to encode any given polygon into amulti-dimensional semantic code (e.g. a feature vector with 64dimensions), based on the image appearance of the polygon as discussedabove.

In some embodiments, Module V may interact with various control systemsto focus on one or more objects in the video data (or other sensordata), e.g., based on the identified polygons and predicted futuremovements. In various embodiments, Module V may comprise ahigh-resolution attention focus module 124 (shown in FIG. 1A), whichcontrols the pan, tilt, and zoom ability of a camera system 101 so thatvarious polygon ROIs are zoomed into and high-resolution frames arecaptured for future analysis or immediate action.

In some embodiments, Module V may comprise a controller on an autonomousvehicle. The controller may be for example a sensor controller thatcontrols the pan, tilt, and zoom ability of a sensor. In someembodiments, the controller may be a motion selector or other decisioncomponent, which controls the motion of the vehicle. For example, themotion selector may take a stream of video frames of a road scene, andintelligently track various objects-of-interest as they move relative tothe vehicle. In one embodiment, for example, a traffic light detected ina video stream may be focused on as the vehicle moves. If the sensordata analysis system on the vehicle detects a change in the state of thetraffic light, the decision component may send different motiondirectives to the motion control subsystem in such an embodiment.

In various embodiments, given the output of Module III and IV, Module Vmay optimize a schedule, e.g., using the optical zoom and pan tiltfeature of a camera or the capabilities of various sensor controllers,to sequentially (e.g. serially) focus on multiple objects-of-interest(polygons) over a sequence of frames.

FIGS. 9A and 9B illustrate examples of scheduling the movements of acamera based on detected objects and object velocities, according tosome embodiments. Initially, in FIG. 9A, the camera may be centered.Modules I and II may detect two persons of interest A and B in the firstimage 910. The scheduler may take as its input two polygons that aredetected in the scene along with their estimated velocity vectors fromModule III, and may use these to estimate how fast each person ismoving. The system may then find the best sequence of movements for thecamera zoom into or focus on the objects-of-interest in the scene. Inthe depicted example scenario of FIG. 9B, as indicated by cameraschedule 920, the camera may first zoom onto person B, before attendingto person A.

FIG. 10A illustrates an example flow of processing for determiningattention focus of one or more cameras, according to some embodiments.As shown in FIG. 10A, an attention focus module 1077, which may beimplemented on one or more computing devices such as the servers shownin FIG. 1A or a sensor controller may comprise at least two componentsin the depicted embodiment: an attention planner 1001 and an attentioncontroller 1002. The attention planner 1001 may, for example, determinethe optimal order in which to traverse a set of objects of interest at aselected frequency. In some embodiments, new plans for object traversalmay be computed at regular intervals, for example every second. Givenone or more target objects to zoom into or focused on, the attentioncontroller 1002 may compute the motor or actuator commands to output atanother selected frequency. In one implementation, for example,traversal plans may be generated by the attention planner every Nmilliseconds, while actuator/motor commands may be issued every Pmilliseconds, where P is less than N.

As illustrated, input 1003 to the attention focus module may be in theform of detected polygons along with estimated future positiontrajectories in various embodiments. Input 1003 may for example comprisea list of objects of interest, their current and future positions andsizes 1004, which may be provided by an MCC-DNN as described above. Insome embodiments, camera sensors (and/or other types of sensors) mayprovide sensor measurements 1005 as additional input to the attentionfocus module, including for example a camera's current position andangle. In the depicted embodiment, the attention controller may include,for example, a proportional-integral (PI) controller subcomponent 1008,as well as a subcomponent 1009 for mapping angular velocity to motorspeeds. In some embodiments, the attention planner may store informationabout objects of interest in a set of internal object queues 1006, and atree search algorithm similar to the A* algorithm may be used to performa greedy search 1007 to identify the next object 1090 on which focus orzoom should be directed. The attentional planner 1001 may interact withthe controller 1002 to compute the motor or actuator speed commands1089, which may be provided to cameras or other sensors.

Module V—Attentional Planner

In some embodiments, a goal of the attention planner 1001 may comprisecentering and/or focusing a camera onto every polygon before the polygonleaves a scene. In one embodiment, the problem may be formulated as agraph based search problem, where the nodes in the graph represent thestate of the environment being analyzed. FIG. 10B is a flow diagramillustrating aspects of operations which may be performed with respectto attentional planning for a camera, according to some embodiments. Thecorresponding pseudo-code is provided in pseudo-code section 2. In theembodiment depicted in FIG. 10B, a hybrid combination of a greedy searchwith depth of 2 and an A* informed search may be employed.

As indicated in element 1010, input comprising a list of objects ofinterest and their respective predicted positions/locations may beobtained in the depicted embodiment at the attention planner. A rootnode of a tree with the current camera parameters (such as an angle oforientation of the camera, optical zoom settings and the like) may becreated (element 1011). The list of objects may be pruned (element1012), e.g., by matching/comparing their semantic codes to those in anobject-of-interest identity database in some embodiments. Objects withsemantic codes which differ by less than a selected threshold E frompreviously-examined objects may be considered to have already beenvisited in some embodiments, and may be removed. A child node may becreated for each of the objects which remain after pruning (element1013). For each root to child edge, the cost of making a correspondingmove may be computed or estimated in at least some embodiments (element1014). In some embodiments, the cost of an edge may be based on the timeit takes to move from the current camera position to the location of thenext unvisited polygon. If such a move is infeasible due to the velocityof the object and/or the angular velocity of the camera, the cost may beset to infinity (or some very high value) in various embodiments.

Pseudo-code section 2: attentional planner INPUT (element 1010 of FIG.10B): List of objects-of-interest in the scene and their predictedlocation  and size in the future (e.g. 1 to 2 seconds) OUTPUT (element1019): A priority queue for zooming into objects-of-interest. GRAPHNODES: State space for each graph node consists of the current cameraparameters and a list  of unvisited polygons in the scene  Step 1(element 1011): Create a root node with camera parameters (e.g., viewangles, optical zoom   parameters).  Step 2 (element 1012): Prune thelist of objects by matching the semantic codes to those in the  Object-of-Interest Identity Database. Codes which are less than athreshold of ϵ from existing   entries in the gallery have been visitedand can be removed.  Step 3 (element 1013): Using the provided list ofobjects-of-interest, create one child node for each   object.  Step 4(element 1014): For each root to child edge, compute the cost of makingsuch a move. If the   move is impossible, the cost is infinite.  Step 5(element 1015): For each child node, construct child nodes for all otherobjects-of-interest in   the list.  Step 6 (element 1016): Compute theedge costs for reaching all the second-level child nodes. This   hascomplexity O(N²), where N is the number of objects in the scene.  Step 7(element 1017): For every object that has been predicted to have exitedthe scene at the   current time of the second-level child node, add aheuristic cost C to the edge. This heuristic is   admissible as it isnever an overestimate of the actual cost.  Step 8 (element 1018):Greedily select the path with lowest depth-2 cost. Add the two nodes tothe   priority queue in order of planned traversals.

With respect to each child node, additional child nodes corresponding toone or more other objects-of-interest in the list may be constructed(element 1015) in the depicted embodiment. The edge costs for reachingthe second-level child nodes may be computed (element 1016). This hascomplexity O(N²), where N is the number of objects being considered. Forevery object that has been predicted to have exited the scene at thecurrent time of the second-level child node, a heuristic cost C may beadded to the edge cost in at least some embodiment (element 1017). Apath with the lowest depth-2 cost may be selected (element 1018) and thetwo nodes corresponding to that path may be added to the priority queuein order of planned traversal in various embodiments. The priority queuemay be provided as output (element 1019).

Module V—Attentional Controller

Pseudo-code section 3: attentional controller INPUT (element 1030):State of the camera parameters, current and predicted location of theobject-of-  interest OUTPUT (element 1037): Motor speed and opticalfocus commands  Set variable Total Error E ← 0  WHILE zoom criterion notmet DO (block 1031)   Step 1 (element 1032): Computed error e is set asthe distance between desired camera angle    and the current cameraangle.   Step 2 (element 1033): Add current error to Total Error: E ←E + e   Step 3 (element 1034): The angular speed is computed as: u(t) ←K1 * e + K2 * E, where K1    and K2 are constants.   Step 4 (element1035): A one-to-one mapping between camera angular speed and the speed   control signal for each of the motors is used for sending motor speedsignals to the camera    controller(s) or microcontroller(s).  END WHILE Step 5 (element 1036): Remove object from list of nodes, invokeattentional planner for zooming   into a new object.

In some embodiments, an attention controller 1002 may be responsible forgenerating a series of motor commands or actuator commands to one ormore sensors based on the output of the attention planner. FIG. 10C is aflow diagram illustrating aspects of operations for controlling anattention-focusing camera, according to some embodiments. Thecorresponding pseudo-code is shown on pseudo code section 3. The input(as indicated in element 1030) may for example comprise the state of acamera and current and predicted positions of some number of objects ofinterest, while the output may comprise a set of motor or actuatorcommands, including for example zoom commands. In an initializationstep, the total error may be set to zero. As indicated in element 1031,one or more iterations of a while loop may be performed in the depictedembodiment until a zoom criterion is met. The zoom criterion may differdepending on the embodiment. For example, in some embodiments, when theobject of interest occupies 256×256 pixels in the image space of thecamera with the help of the optical zoom, the criterion for zooming maybe deemed to be satisfied. In other embodiments, if the long side (e.g.either height or width) of the object of interest becomes 256 pixels,the zoom criterion may be deemed to have been met. In other embodiments,different target image sizes may be used other than 256×256 pixels. Insome embodiments, a proportion-integral (PI) controller 1008 may be usedfor determining motor/actuator movements. Other types of controllers maybe used for the actuators or motors in various embodiments.

As shown in element 1032, within an iteration of the while loop, acomputed error e may be set as the distance between desired camera angleand the current camera angle in the depicted embodiment. The currenterror may be added to the total error (element 1033), and the angularspeed may be computed (element 1034). A one-to-one mapping betweencamera angular speed and the speed control signal for each of the motorsmay be generated and used for sending motor speed signals to the cameracontroller(s) or microcontroller(s) (element 1035). The next iterationof the while loop may then be initiated unless the zoom criterion hasbeen met. After the zoom criterion is met, the object which was thetarget of the zoom may be removed from the input list (element 1036),and the attentional planner may be invoked to provide the next object ofinterest. The motor commands may be transmitted to the targetedcontrollers or microcontroller (element 1037) in the depictedembodiment.

FIG. 11 illustrates a search tree that may be used to plan controlactions for an attention-focusing camera, according to some embodiments.Each node in the search tree or graph may represent a state in thedepicted embodiment, including for example the camera orientation orviewing angle, elevation, and zoom information. A node may also containthe order of the polygons of interest which have been visited so far inat least some embodiments. Each level of the search tree path may, forexample, represent a completion of a zooming task that either ends insuccess or failure. The edges may represent the cost that would beincurred by moving from the parent node to the child node associatedwith the edge. For example, the cost may be high for long pan and tilts,while the cost may be low for fast small adjustments of the camera tothe nearby polygons. In the depicted embodiment, a camera's initialposition is indicated by the root node 1110. A path comprising nodes1114 and 1116 may represent a plan in which first an object B is zoomedon, followed by object A, while a path comprising nodes 1118 and 1120may represent a plan in which A is zoomed on before B.

The planning task may involve solving an optimization problem to find anorder of polygons to traverse so as to minimize a cost function invarious embodiments. The cost function may take into account the time ittakes to physically pan tilt and zoom onto the polygonobject-of-interest. In some embodiments, the physical constraints of thecamera, e.g. whether or not it is at all possible to physically make themove may be considered as well. In the case that a move is physicallyimpossible, the edge cost may be set to a very high value representinginfinity in some embodiments. The planning of camera motor state changesmay represent an example of a model-based reinforcement learning problemwith fully observed states in at least some embodiments.

FIG. 12 illustrates an example computer system of a movable device (e.g.autonomous vehicle) at which sensor data may be analyzed using neuralnetworks, according to some embodiments. As shown, computer 1200 may beconfigured with a plurality of modules. For example, the computer 1200may include an object tracker 1202. Object tracker 1202 may include forexample the MCC-DNN discussed in previous figures. In some embodiments,the road object tracker 1202 may implement all or some combination ofModules I to IV discussed previously.

In some embodiments, the object tracker 1202 may detect objects on theroad, continuously generate polygons for these objects, and continuouslytrack and predict the movements of these object relative to the vehicle,which may itself be moving. In one embodiment, the road object trackermay perform a classification task, which determines the general type ofeach object, which may be associated with an object type indicated inthe ROI or object-of-interested database. Examples of road object typesmay include, among others, a road (i.e. drivable region), lanes and lanemarkers, other vehicles, pedestrians, road obstructions, trafficsignals, road signs, buildings and landmarks, etc. Such data may beprovided as input to various object analysis subsystems 1210 of thevehicle as data objects, e.g., via an API in the depicted embodiment. Insome embodiments, the object analysis subsystems 1210 may simply beportions of the same neural network(s) that implements the objecttracker 1202. In some embodiments, the object analysis subsystems 1210may be downstream neural networks that are separate from the road objecttracker network, and may only receive a small portion of a video frameor other groups of sensor data as input. As shown, the object analysissubsystems 1210 may include a variety of specialized subsystems orsubnetworks 1212, 1214, 1216, and 1218. Such specialized subsystems orsubnetworks may include a vehicle analyzer, a traffic light analyzer, apedestrian analyzer, and a road sign analyzer. Each of these analyzersmay perform specialized tasks corresponding to objects of a given typein some embodiments. For example, the traffic light analyzer may beprogrammed or trained to determine whether a traffic light is red,green, or yellow. A vehicle analyzer may be programmed or trained todetect and predict the speed and direction of a vehicle relative to thevehicle camera, recognize the various signals on the vehicle such asbreak lights and turn signals, and decipher the license plate of thevehicle.

In some embodiments, these object analysis subsystems 1210 may provideoutput to the decision making component 1220, which in turn may providemotion directive to the motion control subsystem 1230 of the movabledevice. In some embodiments, the object analysis subsystems 1210 may bepart of the decision making components 1220. In some embodiments, thedecision making components 1220 may itself be a neural network, whichmay be a subnetwork of the overall network that implements Modules I toIV discussed previously.

As shown, the vehicle may include one or more sensors 1240, which insome embodiments may be movable to pan, tilt, or zoom based on commandsfrom a sensor controller 1250. In some embodiments, the sensorcontroller 1250 may receive control commands from the object tracker1202, which may implement an attentional planner as discussedpreviously. For example, in some embodiments, the object tracker 1202may detect an accident on the side of the road, and move the vehicle'ssensor to focus on the accident as the vehicle moves past the scene.

In some embodiments, the vehicle may include one or more vehicledisplays, which may be associated with display controllers. The displaycontrollers may be capable of controlling various aspects of thedisplay, including for example, whether the display is active, soundcontrols on the display, a selection of source data for the display, andthe zoom level of the display, etc. In some embodiments, the displaycontroller may be able to switch the source of the display from a firstvehicle sensor to a second vehicle sensor, etc. In some embodiments, thedisplay controller may use data generated from the object tracker 1202as control input. For example, in some embodiments, the vehicle may bemonitoring video input from both a front camera and a back camera. Inone example, when the back camera detects an object (e.g. a police caror ambulance in the rear view), the display controller may receive thisinformation from the object tracker 1202 and cause the displaycontroller to switch an in-cabin vehicle display to show the rear viewas seen by the back camera. In some cases, the display controller maymagnify the object of interest in the video (e.g., the police car orfire truck), so that the object is displayed more prominently on thevehicle display. The display controller may make the switching decisionby itself, or in some embodiments, receive control commands from thevehicle's computer, based on the images provided by the vehicle's backcamera. In some embodiments, the display controller may be implementedas least in part as a subnetwork of the MCC-DNN used to implement theobject tracker 1202.

In some embodiments, the vehicle's sensors may be operational even whenthe vehicle is parked or turned off. For example, the vehicle's sensors1240 may act as a vehicle alarm system by monitoring the vehicle'ssurroundings when it is parked. As another example, the sensors 1240 mayact as surveillance cameras. For example, a vehicle may be parked infront of a house at night and perform periodic surveillance of thehouse's surroundings. In some embodiments, the vehicle's sensors may useobject tracker 1202 or pedestrian analyzer to detect and track peoplewalking around the house. If suspicious activity is determined, thevehicle may sound an alarm, or in some embodiments send a message to thevehicle's owner via an email or text.

In some embodiments, the decision making component 1220 may beconfigured to implement an attention focusing system of a movabledevice, such as an autonomous vehicle.

For example, an autonomous vehicle V may be traveling on a road. Thevehicle V may be using its object tracker to track a number of objectsin its vicinity. For example, vehicle V may be tracking a second vehiclein V2 in the front, and a third vehicle V3 on a lane to the right. Inaddition, the vehicle V may be tracking road signs such as exit signalong the road. The objects may be tracked such that their movementsrelative to the vehicle V are monitored. In some embodiments, eachobject may be identified and associated with a known object from theobject database and assigned an enclosing polygon over successive videoframes from a vehicle sensor, and the movements of the polygon may betracked from frame to frame. In some embodiments, the object tracker ofthe vehicle V may account for the movements of vehicle V, so as toisolate the movements of the other vehicles on the road from those ofvehicle V itself. In this manner, the vehicle V may accurately predictthe positions of the other vehicles V2 and V3 on the road, given theircurrent observed movements.

In some embodiments, vehicle V may in effect perceive, using its objecttracking system, that vehicle V2 is moving slightly faster and veeringslightly to the right. In one example scenario, vehicle V may be ablepredict, based on such information, that in two seconds, vehicle V2 maybe in a new position on the road farther ahead of V. Similarly, vehicleV may perceive that vehicle V3 is slower relative to the vehicle V andmoving relatively quickly to the right. Based on the capturedinformation, vehicle V may be able to deduce that vehicle V3 is a truckand making a lane change to the right. Accordingly, vehicle V maypredict that in two seconds, vehicle V3 may be in a new position on theroad farther to the right of V. In addition, the object tracker mayobserve a road sign, and determine that in two seconds the sign will bein new position farther behind V. In some embodiments, the vehicle V maydetermine that the road sign is a stationary object, so that itsposition may be predicted based on the vehicle's own movements alone. Inthat case, the vehicle V may simply determine the position of the signwithin its model of the environment based on its own movements.

In some embodiments, as discussed in connection with FIG. 12, variousobjects identified in the vehicle's operating environment may be furtheranalyzed. For example, the vehicles V2 and V3 may be closely analyzed bya special module or neural network to observe the vehicle's brake lightsand turn signal lights. The road sign may be closely analyzed by aspecial module or neural network to determine the contents of the signin some embodiments. In this example, the vehicle V may determine thatthe road sign is an exit sign indicating an upcoming exit from the road.

In response to determining the exit sign, the vehicle V may determine tochange lanes to make an exit from the road. The vehicle V may useanother embodiment of Module V to establish a plan for a best sequenceof moves to move to the rightmost lane. For example, in the figure, thevehicle V may determine that the best sequence of moves is to firstaccelerate and change to the middle lane in front of the vehicle V3, andthen maintain constant speed and switch to the rightmost lane.Alternatively, another possible sequence of action may be to wait forone second, and then change to the right lane, without accelerating.Depending on the situation, the second alternative may be more or lessdesirable. For example, on the one hand, accelerating the vehicle Vrequires more fuel expenditure. However, on the other hand, byaccelerating, the vehicle V may accomplish the desired sequence ofactions more quickly, and remain a safer distance away from the truckV3. Accordingly, the latter sequence of steps may represent the bestpath. This planning process may be at least somewhat analogous to theplanning process for moving a surveillance camera in some embodiments,as discussed in connection with FIGS. 9A, 9B, 10A, 10B, 10C, and 11.Thus, in some embodiments, the path planner may be implemented as partof the object tracker 1202 or some module on the vehicle computer, andthe controller to carry out the plan may be implemented using thedecision making components 1220 of the vehicle.

In this context, the planner may determine the plan based on anotherdecision graph similar to the graph shown in FIG. 11. In this example,the nodes in the graph may represent different positional states in theroad environment. For example, in this context, one node may correspondto the vehicle V traveling in the middle lane, and moving at 55 milesper hour. During planning, a search may be conducted over all paths froma current node to one or more desired destination nodes. Each edge ofthe graph may be associated with a cost for making the move from onestate to the other. For example, in some situations, a move thatrequires a drastic acceleration or moving very close to another vehiclewould be associated with a high cost. Based on such a graph, an adaptedalgorithm based on the algorithm indicated in FIG. 10B may be used toproduce the best path in some embodiments. In the adapted algorithm, thepath may be generalized to model multiple lane changes. In someembodiments, the planning may occur in stages (e.g., a plan to changetwo lanes, then make a new plan). In some embodiments, variousheuristics may be added to the edge costs. For example, in theillustrated example, missing the exit or making excessively “unsafe”movements may add to the edge costs.

Once the plan is established, the decision making components 1220 mayimplement the plan by issuing fine grained movement directives to thevehicle's motion control subsystem 1230. For example, to implement afirst step in the plan, a motion selector may cause the vehicle V toveer right while accelerating mildly, for example by an additional 3miles per hour. In some embodiments, the decision making components 1220may also precede the lane change by turning on a turn signal for a fewseconds.

FIG. 13 is a flow diagram illustrating aspects of operations forcontrolling the movements of a movable device (e.g. an autonomousvehicle), according to some embodiments. In operations corresponding toelement 1310, groups of sensor data such as video frames captured bysensors on a movable device (e.g. an autonomous vehicle) may bereceived. Various types of sensors may be used in different embodiments.As discussed, in some embodiments the video data may include video datacaptured by traditional video cameras, or successive frames of datacaptured by other types of sensors, such as LIDAR devices, infraredcameras, etc. The sensor data may be received by the vehicle's onboardcomputer, which may be tasked with making movement decisions for theautonomous vehicle based on analysis of the sensor data.

As indicated in element 1312, an object may be detected in the sensordata. The object may be identified to be a type of object specified inan object database (e.g. a database of road objects) in someembodiments. The road object database may be implemented using forexample an object-of-interest or region-of-interest (ROI) database 125and associated database module 126, as discussed in connection with FIG.8. The detection operation may be performed by the object tracker 1202and/or Module I discussed previously. In some embodiments, the objecttracker 1202 may perform a classification using a neural network todetermine the type of the object. The detected object may be a roadobject such as another vehicle, a pedestrian, a traffic signal, a roadsign, etc.

As indicated in element 1314, polygons may be generated for the objectin each of subsequent video frames or other groups of sensor data. Thisoperation may be performed by the object tracker 1202 and/or Module IIdiscussed previously. A given polygon may, for example, be generatedsuch that it encompasses all pixels in a video frame that are determinedto contain the road object. In some embodiments, the polygon may begenerated using a pixelated heat map.

As indicated in element 1316, portions of the video frames or othersensor data groups within the generated polygons may be monitored usingan object analysis technique selected based on the object type. Thisoperation may be performed by a vehicle computer in various embodiments.For example, depending on the determined type of the road object, one ofa number of specialized analyzers may be used to continuously analyzethe polygons in the successive video frames. These specialized analyzersmay include for example the analyzers 1212, 1214, 1216, and 1218discussed in connection with FIG. 12. A detected vehicle on the road maybe tracked by a vehicle analyzer to monitor its movements, brake or turnlight signals, etc. A traffic light may be tracked by a traffic lightanalyzer to monitor its state. In some embodiments, the specializedanalyzers may be implemented as neural networks, which may be separateor a part of the MCC-DNN discussed previously.

As indicated in element 1318, a determination may be made whether astate change in the object is detected based on the monitoring invarious embodiments. Such state changes may be determined by thespecialized analyzers (e.g., analyzers 1212, 1214, 1216, and 1218)implemented on the vehicle computer in some embodiments. For example, avehicle analyzer may detect that the brake lights of a vehicle in frontof the autonomous vehicle have turned on. As another example, thevehicle analyzer may detect that the vehicle in front of the autonomousvehicle is too close to the vehicle where the analysis is beingperformed, given the two vehicles' traveling speeds. As yet anotherexample, a traffic light analyzer may detect that a monitored trafficlight has turned from green to yellow. These detections may all comprisea state change of the road object. In some embodiments, detected statechanges may cause the analyzers to communicate with a decision makingcomponent of the vehicle (e.g. a motion selector) to determine movementsfor the autonomous vehicle as a result of the state changes, forexample, to slow down the vehicle, avoid a collision, etc. In someembodiments, a motion selector or other decision making component maymaintain a model of the autonomous vehicle's surroundings. In someembodiments, the motion selector may be implemented at least in part asa downstream neural network from the MCC-DNN.

As indicated in element 1320, one or more motion directives may beprovided to a motion control subsystem of the movable device based onthe prediction. This operation may be performed by a decision makingcomponent such as a motion selector. In some embodiments, the motiondirectives may be provided as control signals at a specified frequencyto the motion control subsystem. The motion directive may includedirectives to accelerate, decelerate, change direction, etc. In someembodiments, the detected state changes may also cause actions to beperformed by other systems on the autonomous vehicle, for example, toflash the headlights or sound the horn, etc.

FIG. 14 is a flow diagram illustrating aspects of operations formovement planning at a movable device (e.g. an autonomous vehicle) usingan attention focusing system, according to some embodiments. Operationscorresponding to the first three elements of FIG. 14, (elements 1410,1412, and 1414) may be performed in similar fashion as operationsdiscussed above with respect to the first three elements of FIG. 13(elements 1310, 1312, and 1314) in at least some embodiments.

A prediction of the future movements of detected objects may bedetermined based at least in part on the previous polygons of theobjects (element 1416) in the depicted embodiment. This operation may beperformed by the object tracker 1202 and/or Module III discussedpreviously. For example, the determination may be performed by modelingthe past polygons associated with the object and then generating afuture polygon for a future frame. In some embodiments, the predictionmay generate multiple future polygons for multiple frames or time stepsinto the future. In some embodiments, such successive future polygon maybe computed based in part on the predicted positions of the polygonsthat have not yet been observed.

In operations corresponding to element 1418, a movement plan may bedetermined for the movable device (e.g. autonomous vehicle), forexample, from a current position to a desired position relative to thedetected objects. The determination of the movement plan may be madebased on the predicted movements of the road objects in variousembodiments, and the determination may be performed by the objecttracker 1202 (or some other module or neural network implemented onvehicle computer) and/or Module IV discussed previously. In someembodiments, the vehicle computer may maintain a spatial model of itssurroundings, including the relative positions and movements of thedetected road objects. In some embodiments, the vehicle computer maygenerate a planning tree representing different movement plans availableto the autonomous vehicle and the results of the movement plans. Forexample, in an example planning tree, each edge may correspond to a moveby the autonomous vehicle (e.g., to accelerate past another vehicle, orchange between lanes, etc.), and the nodes may represent differentfuture positional states of the autonomous vehicle's surroundings (e.g.,different positional configurations of the autonomous vehicle relativeto the other cars on the road). The vehicle computer may then use theplanning tree to make a determination as to the best plan to go from theautonomous vehicle's current position to the desired position. Thedetermination may be made best on a tree search that takes into accountrespective cost functions associated with the edges. For example, in onescenario, the autonomous vehicle may determine that it needs to movefrom one lane to another lane to exit a highway. The autonomousvehicle's movement planner may determine that the move may beaccomplished by a sequence of smaller movements to safely navigatethrough the traffic on the highway. In some embodiments, the costfunction associated with each smaller movement may take into accountfactors such as the risk of the move (e.g., how close the move may bringthe autonomous vehicle to another car, how much the move would requirethe autonomous vehicle to speed up or slow down, etc.) In someembodiments, the movement planner may sum the cost functions of eachedge of the planning tree to determine a total cost for a plan, andselect a movement plan that has the a minimum cost function.

In operations corresponding to element 1420, the one or more motiondirectives may be provided to a motion control subsystem of the movabledevice (e.g. autonomous vehicle) to carry out the movement plan. Thisoperation may be performed by a decision component 1220, such as amotion planner or motion selector. In some embodiments, the decisionmaking component 140 may include memory to store intermediate statesthat are needed to achieve the final desired state of the movement plan.In accordance with the plan, in one embodiment a motion selector maytransmit movement directives to the motion control subsystem to take theautonomous vehicle from one intermediate state to the next, until thedesired state is achieved. In some embodiments, the vehicle's movementplanner may continuously update its movement plan based on additionalinformation from new groups of sensor data such as newly acquired videoframes. In some embodiments, the movement planner may change or abandona movement plan based on new information. The decision making componentsmay generate motion directives in accordance with the currently selectedmovement plan, and provide the directives to the motion controlsubsystem in a manner similar to operation 1320 in FIG. 13. It is notedthat in various embodiments, at least some operations other than thoseillustrated in the flow diagrams of FIG. 4, FIG. 5C, FIG. 10B, FIG. 13and/or FIG. 14 may be used to implement the sensor data analysistechniques described above. Some of the operations shown may not beimplemented in some embodiments or may be implemented in a differentorder, or in parallel rather than sequentially.

Other example embodiments are provided below.

In a general example embodiment, a computing system is providedcomprising: a dynamic camera device comprising a memory device and oneor more processors configured to obtain video data, detect an object ina region of interest (ROI) in the video data using deep neural networkcomputations, generate a polygon within the ROI, and predict a futuremovement of the polygon; and a computing device in data communicationwith the camera device and comprising an object of interest database,the computing device configured to receive the polygon and the futuremovement of the polygon from the camera device, and generate one or morecommands to control movement of the dynamic camera device.

In another general example embodiment, a computing system is providedcomprising: a communication device configured to receive video data thatis transmittable by a video camera; a memory device configured to storethe video data and an object of interest database; and one or moreprocessors configured to at least: pre-process the video data, detect anobject in a region of interest (ROI) in the video data using deep neuralnetwork computations, generate a polygon within the ROI, predict afuture movement of the polygon, store the polygon in the database, anduse the future movement of the polygon to output one or more commands tocontrol movement of the dynamic camera device.

In another general example embodiment, a video camera system is providedcomprising: an image sensor for capturing video data; a memory deviceconfigured to store the video data and an object of interest database;and one or more processors configured to at least: pre-process the videodata, detect an object in a region of interest (ROI) in the video datausing deep neural network computations, generate a polygon within theROI, predict a future movement of the polygon, store the polygon in thedatabase, and use the future movement of the polygon to output one ormore commands to control movement of the dynamic camera device.

In a general example embodiment, a method performed by a computingsystem is provided. The method comprising: obtain video data via adynamic camera device; pre-process the video data; detect an object in aregion of interest (ROI) in the video data using deep neural networkcomputations; generate a polygon within the ROI; compute a predictedfuture movement of the polygon; and use the future movement of thepolygon to compute and transmit one or more commands to control movementof the dynamic camera device.

In a general example embodiment, a camera system is provided comprising:an image sensor for capturing video data; a memory device for storingthe video data as sequence of video frames at successive time steps; andone or more processors. The one or more processors are configured to atleast: generate a sequence of polygons at each of the time steps withina corresponding one of the sequence of video frames. And, for each givenpolygon at each given time step, the one or more processors: obtain alocation and vertices of the given polygon as input into a recurrentpolygon network model; output a value from the recurrent polygon networkmodel; and use the multiple outputted values corresponding to differenttime steps to compute a predicted future movement of the polygon.

In an example aspect, the multiple outputted values comprise recurrenceconnections in the recurrent polygon network model.

In a general example embodiment, a computing system is provided,comprising: a communication device configured to receive video data thatis transmittable by a video camera; a memory device configured to storethe video data and an object of interest database; and one or moreprocessors configured to at least generate a sequence of polygons ateach of the time steps within a corresponding one of the sequence ofvideo frames. The one or more processors, for each given polygon at eachgiven time step, also: obtain a location and vertices of the givenpolygon as input into a recurrent polygon network model; output a valuefrom the recurrent polygon network model; and use the multiple outputtedvalues corresponding to different time steps to compute a predictedfuture movement of the polygon.

In an example aspect, the multiple outputted values comprise recurrenceconnections in the recurrent polygon network model.

In a general example embodiment, a method performed by a computingsystem is provided. The method comprising: obtaining video data recordedby a camera device; storing the video data in memory as sequence ofvideo frames at successive time steps; and generating, via one or moreprocessors, a sequence of polygons at each of the time steps within acorresponding one of the sequence of video frames. For each givenpolygon at each given time step, the method comprising: obtain alocation and vertices of the given polygon as input into a recurrentpolygon network model; output a value from the recurrent polygon networkmodel; and use multiple outputted values corresponding to different timesteps to compute a predicted future movement of the polygon.

In an example aspect, the multiple outputted values comprise recurrenceconnections in the recurrent polygon network model.

In a general example embodiment, a camera system is provided comprising:an image sensor for capturing video data; a memory device for storingthe video data as video frames; and one or more processors. The one ormore processors are configured to at least: pre-process a given videoframe to generate a post processed image; input the post processed imageinto a first multi-layer cross-correlation deep neural network (MCC DNN)in order to output a heat map, the heat map comprising a binary imagewith each pixel having a value representing the likelihood of having anobject of interest at that location; multiplicatively gate the postprocessed image and the heat map to generate a gated image; input thegated image into a second MCC DNN to generate a MCC DNN layeractivation; input the MCC DNN layer activation into a polygon recurrentneural network (RNN) to generate a set of vertices that define a convexpolygon; and compute and output the convex polygon located within thegiven video frame using the set of vertices.

In a general example embodiment, a computing system is providedcomprising: a communication device configured to receive video data thatis transmittable by a video camera; a memory device configured to storethe video data as video frames; and one or more processors. The one ormore processors are configured to at least: pre-process a given videoframe to generate a post processed image; input the post processed imageinto a first multi-layer cross-correlation deep neural network (MCC DNN)in order to output a heat map, the heat map comprising a binary imagewith each pixel having a value representing the likelihood of having anobject of interest at that location; multiplicatively gate the postprocessed image and the heat map to generate a gated image; input thegated image into a second MCC DNN to generate a MCC DNN layeractivation; input the MCC DNN layer activation into a polygon recurrentneural network (RNN) to generate a set of vertices that define a convexpolygon; compute and output the convex polygon located within the givenvideo frame using the set of vertices.

In a general example embodiment, a method performed by a computingsystem is provided. The method comprising: obtaining video data recordedby a camera device; storing the video data in memory as video frames;pre-processing, using one or more processors, a given video frame togenerate a post processed image; inputting the post processed image intoa first multi-layer cross-correlation deep neural network (MCC DNN) inorder to output a heat map, the heat map comprising a binary image witheach pixel having a value representing the likelihood of having anobject of interest at that location; multiplicatively gating the postprocessed image and the heat map to generate a gated image; inputtingthe gated image into a second MCC DNN to generate a MCC DNN layeractivation; inputting the MCC DNN layer activation into a polygonrecurrent neural network (RNN) to generate a set of vertices that definea convex polygon; and computing and outputting the convex polygonlocated within the given video frame using the set of vertices.

In a general example embodiment, a camera system is provided comprising:an image sensor for capturing video data; one or more actuators tocontrol at least one of pan, tilt, and zoom affecting the image sensor;a memory device for storing the video data; and one or more processors.The one or more processors are configured to at least: analyze the videodata to determine a current position of an object of interest and todetermine a predicted future movement of the object of interest; use thecurrent position, the future movement and one or more physicalconstraints of the one or more actuators to plan a path of movement ofthe camera to point at a future predicted position of the object ofinterest; and use the path to generate actuator commands to control thecamera to point at the future predicted position of the object ofinterest.

In an example aspect, there are multiple objects of interest in thevideo data, and wherein the path is computed to traverse multiple futurepredicted positions of the multiple objects of interest.

In a general example embodiment, a computing system is providedcomprising: a communication device configured to receive video data thatis transmittable by a video camera; a memory device configured to storethe video data; and one or more processors. The one or more processorsare configured to at least: analyze the video data to determine acurrent position of an object of interest and to determine a predictedfuture movement of the object of interest; use the current position, thefuture movement and one or more physical constraints of the one or moreactuators to plan a path of movement of the video camera to point at afuture predicted position of the object of interest; and use the path togenerate actuator commands to control the video camera to point at thefuture predicted position of the object of interest.

In an example aspect, there are multiple objects of interest in thevideo data, and wherein the path is computed to traverse multiple futurepredicted positions of the multiple objects of interest.

In a general example embodiment, a method performed by a computingsystem is provided. The method comprising: obtaining video data recordedby a camera device; storing the video data in memory; processing thevideo data using one or more processors to determine a current positionof an object of interest and to determine a predicted future movement ofthe object of interest; using the current position, the future movementand one or more physical constraints of the one or more actuators toplan a path of movement of the camera to point at a future predictedposition of the object of interest; and using the path to generateactuator commands to control the camera to point at the future predictedposition of the object of interest.

In a general example embodiment, a computer system is providedcomprising one or more processors and an associated memory, the memorystoring a deep neural network, the deep neural network configured toreceive sensor data captured by one or more sensors, the sensor datacomprising one or more successive image frames, detect an object in theimage frames, generate a plurality of polygons surrounding the object ineach of the successive image frames, and generate a prediction of afuture position of the object based at least in part on the plurality ofpolygons, and the one or more processors are further configured toprovide one or more commands to a control system based at least in parton the prediction of the future position of the object.

In an example aspect, the computer system implemented so that the one ormore sensors are located on an autonomous vehicle and configured tocapture sensor data of a road scene, the control system comprises amotion control subsystem of the autonomous vehicle, and the one or morecommands comprise motion directives to the motion control subsystem tocontrol movements of the autonomous vehicle.

In an example aspect, the computer system is implemented such that theone or more sensors includes a Light Detection and Ranging (LIDAR)device.

In an example aspect, the computer system is implemented such that thedeep neural network is configured to determine a type of the objectspecified in an object-of-interest database, monitor portions of theimage frames in the respective polygons using an object analysistechnique selected based on the object type, and detect a state changein the object based on the monitoring, and the one or more motiondirectives are generated based at least in part on the detection of thestate change.

In an example aspect, the computer system is implemented such that thedeep neural network is configured to determine the type of the objectselected from a list comprising vehicle, a pedestrian, a traffic signal,or a road sign.

In an example aspect, the computer system is implemented such that thedeep neural network is configured to generate predictions of respectivefuture movements of a plurality of objects detected in the image frames,and determine a movement plan to move the autonomous vehicle from acurrent position to a desired position relative to the plurality ofobjects based at least in part on their predicted future movements, andthe one or more motion directives are generated based at least in parton the movement plan.

In a general example embodiment, a method is provided. The methodcomprises receiving sensor data captured by one or more sensors, thesensor data comprising one or more successive image frames. The methodcomprises using a deep neural network: detecting an object in the imageframes, generating a plurality of polygons surrounding the object ineach of the successive image frames, and generating a prediction of afuture position of the object based at least in part on the plurality ofpolygons. The method also comprises providing one or more commands to acontrol system based at least in part on the prediction of the futureposition of the object.

In a general example embodiment, a non-transitory computer-accessiblestorage medium storing program instructions is provided. The programinstructions when executed on one or more processors cause the one ormore processors to receive sensor data captured by one or more sensors,the sensor data comprising one or more successive image frames, use adeep neural network to detect an object in the image frames, to generatea plurality of polygons surrounding the object in each of the successiveimage frames, and to generate a prediction of a future position of theobject based at least in part on the plurality of polygons, and provideone or more commands to a control system based at least in part on theprediction of the future position of the object.

It will be appreciated that any module or component exemplified hereinthat executes instructions or operations may include or otherwise haveaccess to computer readable media such as storage media, computerstorage media, or data storage devices (removable and/or non-removable)such as, for example, magnetic disks, optical disks, or tape. Computerstorage media may include volatile and non-volatile, removable andnon-removable media implemented in any method or technology for storageof information, such as computer readable instructions, data structures,program modules, or other data, except transitory propagating signalsper se. Examples of computer storage media include RAM, ROM, EEPROM,flash memory or other memory technology, CD-ROM, digital versatile disks(DVD) or other optical storage, magnetic cassettes, magnetic tape,magnetic disk storage or other magnetic storage devices, or any othermedium which can be used to store the desired information and which canbe accessed by an application, module, or both. Any such computerstorage media may be part of the systems, devices, and servers describedherein, or accessible or connectable thereto. Any application or moduleherein described may be implemented using computer readable/executableinstructions or operations that may be stored or otherwise held by suchcomputer readable media.

It will also be appreciated that the examples and corresponding diagramsused herein are for illustrative purposes only. Different configurationsand terminology can be used without departing from the principlesexpressed herein. For instance, components and modules can be added,deleted, modified, or arranged with differing connections withoutdeparting from these principles.

The steps or operations in the computer processes, the flow charts andthe diagrams described herein are just for example. There may be manyvariations to these steps or operations without departing from theprinciples discussed above. For instance, the steps may be performed ina differing order, or steps may be added, deleted, or modified.

Although the above principles have been described with reference tocertain specific examples, various modifications thereof will beapparent to those skilled in the art as outlined in the appended claims.

What is claimed is: 1-20. (canceled)
 21. A movable device, comprising:one or more sensors configured to capture one or more image frames; oneor more processors and an associated memory, the memory storing a neuralnetwork configured to: receive the one or more image frames captured bythe one or more sensors; detect an object in the one or more imageframes, including to: determine a probability that a portion of theobject is positioned at a location in one of the one or more imageframes; and produce a post-processed image using one or moretransformations of the image frame that removes one or more areas of theimage frame that does not include the object; generate a plurality ofpolygons surrounding the object in individual ones of the one or moreimage frames; generate a prediction of a future position of the objectbased at least on the plurality of polygons; and generate, based atleast on the prediction of the future position of the object, a movementplan of the moveable device; and a control system configured to performone or more commands issued by the one or more processors to execute themovement plan.
 22. The movable device of claim 21, wherein the one ormore processors is configured to: generate, based at least on theprediction of the future position of the object, a plurality of movementplans; and select the movement plan from the plurality of movement plansbased on application of a cost function to individual ones of theplurality of movement plans.
 23. The system as recited in claim 21,wherein to generate the prediction of the future position of the object,the neural network is configured to: obtain respective centroids andvertices for individual ones of the plurality of polygons; and determinea position of a future polygon in a future image frame, based at leastin part on the respective centroids and vertices.
 24. The movable deviceas recited in claim 21, wherein: the control system comprises acontroller for a video camera, and the one or more commands instruct thevideo camera to move or zoom to focus attention on the object.
 25. Thesystem as recited in claim 1, wherein: the control system comprises amotion control subsystem of a vehicle, and the one or more commandscomprise motion directives to the motion control subsystem to controlmovements of the vehicle.
 26. The system as recited in claim 21, whereinthe one or more commands include a command to accelerate or deceleratethe movable object.
 72. The system as recited in claim 21, wherein theone or more sensors include a Light Detection and Ranging (LIDAR)device.
 28. The system as recited in claim 21, wherein one or moreprocessors implement an object tracker configured to detect and trackobjects of different types in the image frames.
 29. The system asrecited in claim 21, wherein one or more processors is configured togenerate a command to the control system in response to a detected statechange in the object.
 30. The system as recited in claim 21, wherein:the object is another movable device, and the movement plan is generatedto avoid a collision between the movable device and the other movabledevice.
 31. A method, comprising: capturing, via one or more sensors ofa movable device, one or more image frames; performing, using a neuralnetwork implemented on one or more processors and an associated memoryon the movable device: receiving the one or more image frames capturedby the one or more sensors; detecting an object in the one or more imageframes, including: determining a probability that a portion of theobject is positioned at a location in one of the one or more imageframes; and producing a post-processed image using one or moretransformations of the image frame that removes one or more areas of theimage frame that does not include the object; generating a plurality ofpolygons surrounding the object in individual ones of the one or moreimage frames; generating a prediction of a future position of the objectbased at least on the plurality of polygons; and generating, based atleast on the prediction of the future position of the object, a movementplan of the moveable device; and performing, by a control system of themovable device, one or more commands issued by the one or moreprocessors to execute the movement plan.
 32. The method as recited inclaim 31, wherein: the control system comprises a controller for a videocamera, and the one or more commands instruct the video camera to moveor zoom to focus attention on the object.
 33. The method as recited inclaim 31, wherein: the control system comprises a motion controlsubsystem of a vehicle, and the one or more commands comprise motiondirectives to the motion control subsystem to control movements of thevehicle.
 34. The method as recited in claim 31, wherein the one or morecommands include a command to accelerate or decelerate the movableobject.
 35. The method as recited in claim 31, wherein the one or moresensors include a Light Detection and Ranging (LIDAR) device.
 36. Themethod as recited in claim 31, further comprising tracking, via anobject tracker implemented on the one or more processors, a pluralityobjects of different types in the image frames.
 37. The method asrecited in claim 31, further comprising the one or more processorsgenerating a command to the control system in response to a detectedstate change in the object.
 38. The method as recited in claim 31,wherein: the object is another movable device, and the movement plan isgenerated to avoid a collision between the movable device and the othermovable device.
 39. A non-transitory computer-accessible storage mediumstoring program instructions that when executed on one or moreprocessors cause the one or more processors to: receive one or moreimages via one or more sensors of a movable device; use a neural networkto implemented on a movable device to: detect an object in the one ormore image frames, including to: determine a probability that a portionof the object is positioned at a location in one of the one or moreimage frames; and produce a post-processed image using one or moretransformations of the image frame that removes one or more areas of theimage frame that does not include the object; generate a plurality ofpolygons surrounding the object in individual ones of the one or moreimage frames; generate a prediction of a future position of the objectbased at least on the plurality of polygons; and generate, based atleast on the prediction of the future position of the object, a movementplan of the moveable device; and issue one or more commands to a controlsystem of the movable device, wherein the one or more commands causesthe control system to execute the movement plan.
 40. The non-transitorycomputer-accessible storage medium as recited in claim 39, wherein: theobject is another movable device, and the movement plan is generated toavoid a collision between the movable device and the other movabledevice.